On Friday February 28th, Let’s Encrypt made the tough decision to revoke over 3 million certificates they had issued due to a bug in the software they use to validate CAA records. This gave companies relying on Let’s Encrypt under a week to replace these certificates on their endpoints. While this procedure did not necessarily require downtime (depending on the specific server configuration) it did require administrator intervention. Let’s Encrypt, and more broadly, infrastructure-as-code (IAC) solves a lot of classic problems while introducing new, uncharted problems such as this incident. How can companies work to determine whether Let’s Encrypt is right for them?
We should back up for a second for those unfamiliar with Let’s Encrypt. It is a free service which provides publicly trusted SSL/TLS certificates to endpoints. It relies on the ACME protocol for standardised automation of the CSR submit / response process. By automating this process, shorter validity periods become more palatable, so Let’s Encrypt certificates are valid for a period of 3 months as opposed to the much longer validity periods of 1-2 years for most Certificate Authorities.
The ability to issue unlimited publicly trusted certificates for free is a truly awe-inspiring thing for the Internet-of-Things (IOT) community. It allows skilled engineers to deliver products to users that otherwise would have been cost-prohibitive at scale. It enables Administrators with advanced scripting skills to remove a time-consuming manual process from their infrastructure. But this does not mean that it is always the appropriate choice.
For many more cautious companies, jumping on the Let’s Encrypt bandwagon presupposes a much heartier risk appetite than they would otherwise be comfortable with. It is companies like those for which this event is a rude awakening. These companies may have implemented Let’s Encrypt as a short-sighted cost saving measure. While there are many good reasons to integrate Let’s Encrypt into your environment, cost-saving is not one of them. The relatively low cost of SSL/TLS certificates is a cost of doing business. For companies without a centralised monitoring infrastructure and round-the-clock engineers to address problems, it can often be safer to purchase longer-lived certificates in order to minimise risk.
How then can such a serious bug occur? If you’re a regular reader of this blog, you already know that CAA records are a special type of record in DNS that you can be used in order to whitelist particular signing authorities. Otherwise, any trusted CA can sign any path for any browser. For a CA to issue a certificate for a domain, they have collectively agreed as part of the validation process to check for such records. This is why Let’s Encrypt was under so much pressure to resolve this issue – their very existence as a publicly trusted Certificate Authority depended on a prompt response.
Boulder, Let’s Encrypt’s server-side software, is written in Go, and released under an open-source license. We can actually see the changes made (https://github.com/letsencrypt/boulder/pull/4690/files) in order to correctly re-check for the presence of CAA records within 8 days of certificate issuance. Such is the nature of software development in the modern world where a handful of lines of code can affect thousands or millions of organisations around the globe.
Ultimately, it is important that companies take stock of their threshold for risk. There is no one size fits all guideline here, but in terms of a general recommendation, any company without a 24/7 computer engineering team should think twice before relying heavily on a third party with no fiduciary responsibility to your company.
You can read more about it from let's Encrypt, from their own announcement here: https://community.letsencrypt.org/t/revoking-certain-certificates-on-march-4/114864