The Zen of CI

On running tests that are expected to fail

·

4 min read

TL;DR

A colleague recently suggested a change that I thought wouldn't work, so my first response was 'you can waste your time trying that'. But then I realised that the change itself was trivial, and I didn't need to spend time setting up a test environment, I could just raise a pull request and wait for the continuous integration (CI) tests to fail.

Only the tests didn't fail. My colleague's approach was valid. But best of all neither of us had spent much time on it, the existing test suite did all the work.

Background

For each pull request for at_server, which is the core implementation of The atPlatform we run a test suite that consists of unit tests, functional tests and end to end tests. The end to end tests had started failing, which was weird, as the only recent changes were to docs, so nothing had changed in the implementation.

A dive into the logs on the CI virtual machines showed certificate errors, specifically that the certificates were out of date. This was also weird, as the certificates are automatically renewed by Certbot, and were quite clearly in date. I could even validate that on the command line.

Then I remembered that LetsEncrypt had notified about an expiring legacy Root Certificate Authority (CA) certificate. But surely that wouldn't affect us... we definitely had the latest CA certs in place.

It turns out the expiration of DST Root CA X3 did affect us though, because LetsEncrypt had chosen defaults that were meant to help with older systems, but had the unintended consequence of causing a bunch of collateral damage.

Picking through the chain of trust in fullchain.pem... Our nice in date certs for the CI infrastructure were issued by the LetsEncrypt R3 intermediate CA, and the R3 was issued by the ISRG root CA X1. We had ISRG X1 in our cacerts, so all should be good. Well, no actually, because the ISRG root CA X1 was signed by DST root CA X3, which expired on 30 Sep. So the Dart HTTP package was walking along the chain of trust, getting to the expired root, and returning an error.

Our immediate workaround was to cut the ISRG root CA X1 (signed by DST root CA X3) out of fullchain.pem (something that can be achieved by invoking Certbot with an optional flag: certbot renew --force-renewal --preferred-chain "ISRG Root X1"). But workarounds can come back to bite later, and that particular one is likely to cause trouble when LetsEncrypt switch to their X2 CA in due course.

If you've an appetite for more gory details on why things weren't working then check out this comment on the GitHub issue for "HandshakeException: Handshake error in client (OS Error: CERTIFICATE_VERIFY_FAILED: unable to get local issuer certificate".

My colleague's suggestion

What if we just delete DST X3 from cacerts?

I totally didn't expect this to work. What I thought would happen is Dart HTTP would walk the chain of trust to the DST root CA X3, find it missing from cacerts and return an error based on that.

I also thought about how long it would take me to set up a test environment to prove that it wouldn't work.

But I already had a test environment

All I needed to do was implement my colleague's (doomed to fail) suggestion, and let the CI take care of it.

I had to back out the fullchain.pem workaround on the CI VMs, but that took seconds (just like making the change and submitting the PR that would send it into CI).

It would only be a few minutes before I could watch the tests fail again, then smugly tell my colleague that his suggestion didn't work.

Except the tests passed. It turns out that Dart HTTP is smart enough to figure out a correct chain of trust to a valid CA in cacerts.

Hurray though, no need for a workaround any more.

This is making me rethink my testing philosophy

How often have I been building local test harnesses when I should just send stuff through CI (even if it means adding more tests to the CI, which is generally a good thing)?

How often have I not even tried something in the expectation that it will fail, when it just might work, and the CI means I don't have to expend much effort to find out?