Having worked in the tech industry for over twenty years, I’ve been in the thick of incident response and analysis even when they were not direct roles and responsibilities for me. One of the most common contributing factors for outages is the expiration of SSL/TLS certificates connected to mission critical services. To compound the problem, certificate expiration periods are getting shorter. As laid out by Google in this post on chromium.org, they anticipate shorter certificate expiration periods across the spectrum in the range of 90 days. In this article, we’ll:
- Dive into some stories of incidents caused by certificate expiration
- Review some of the contributing factors for why certificates expire
- Explore mitigations and approaches to renew certificates in an automated fashion
A History of SSL/TLS Certificate Incidents
Way back in the wild west days of the burgeoning commercial internet of the 1990s, security was an afterthought for a good majority of transactions. One of the biggest and fastest growing companies during the dotcom boom was Netscape, releasing the most popular web browser of its time. Along with this came the need for securing communication. Efforts had been made to codify standards for secure communication over TCP connections, but with Netscape’s vast reach and market penetration, it provided the widest platform for adoption of what was dubbed the secure socket layer (SSL). There were…issues with the first release, but the second one stuck, and was soon after followed by SSLv3 that became widely adopted.
With SSLv3 catching on in the early 2000s, certificate expiration became a common problem. I worked for Sendmail, Inc. at the time, and the company had recently released a popular UI for the message transfer agent (MTA) which incorporated certificates for secure communication both between the browser and the UI as well as between UI installations over a custom port and protocol. Self-signed SSL certificates were set to expire after one year. Invariably on day 366, we would receive customer calls in the technical support department about the certificates expiring, even though multiple warnings existed throughout the UI and product documentation. Without automatic regeneration of these certificates, many customers required help to correct the issue. Fortunately, this was limited in scope since the MTA would keep delivering mail.
The same can’t be said for more well known outages over the years. Even though SSLv3 has largely been replaced by a newer standard, TLS (Transport Layer Security), the basic structure of problems related to renewal has remained intact. When critical services in an infrastructure require certificates for communication, whole sites and sometimes even physical infrastructure can go down when they expire. Just a few examples:
- Mac Store Apps Stopped Working Due To Expired Security Certificate (2015)
- All Oculus Rift headsets down! (2018)
- Epic Games - Expiration Date 4-6-2021 (2021)
- Shopify - That Old Certificate Expired and Started an Outage. Here’s What Happened Next. (2021)
- SpaceX Starlink outage (2023)
As with all incidents, the number of contributing factors for SSL/TLS certificate expiration incidents is vast. Using just one example, if you have a certificate expiration interval of 365 days (relatively lengthy), you are (a) less likely to automate the process, and (b) more susceptible to forget to rotate altogether. In a way, certificate expiration intervals can be likened to software development lifecycles. The general consensus in software development has landed on shorter and more frequent release cycles to help reduce risk. The same concept applies to certificate expiration. If, for example, your certificate expires in 365 days, it is entirely understandable that potentially less time will be invested to automate the rotation of the certificate. However, having to spend effort manually to do this every 90 days incurs a significant quarterly penalty, quadrupling the amount of work. This makes it a more attractive proposition to automate this work, thus mitigating at least one avenue of risk related to a manual rotation process.
Another common contributing factor is organizational friction. This can manifest in a number of different ways, including being short staffed and the prioritization of new features over resiliency development. As well, there may be acknowledgement of the need for mitigation efforts, but there might be other resiliency efforts that take precedence due to a perceived notion or quantitative efforts to rationalize the deprioritization of certificate expiration resilience work. As a concrete example, let’s say an organization has the following resiliency initiatives:
- Disaster preparedness and recovery
- Certificate expiration management
- Closing gaps in monitoring
Most organizations do not have a good way to assess the dollar impact of each of these initiatives, so prioritization is typically a conversation amongst stakeholders (with the ever looming specter of new feature work). An estimate of work might be undertaken, and things like disaster recovery might take a back seat to closing gaps in monitoring, since the latter is seen as a more urgently rectified problem that will produce almost immediate value, whereas disaster preparedness or certificate expiration management are seen as future problems that do not have to be dealt with in the immediate term.
Prioritization challenges were seen in an Epic Games outage in 2021, as noted in their excellent publicly available writeup:
Automatic renewals were not enabled for this internal certificate, and the work needed to accomplish this had not been prioritized when identified earlier this year.
A Survey of Remedies
One piece that is typically missing in the search for practices to achieve better resiliency is the intersection of quality practices and learning from incidents. Certificate expirations are a perfect example of justification for utilizing testing methodologies such as chaos testing. To quote again from the writeup of the Epic Games outage:
The TLS certificate that expired created an outage that triggered unexpected behavior in our launcher client. Our investigation revealed our client retry was using linear retry logic instead of the exponential backoff we expected. An additional unexpected bug also caused the request pattern from millions of Epic Games Launcher clients to continuously and endlessly retry until a successful response was received. These two bugs across our client install base created an unintended and unforeseen call pattern. We were effectively DDoSed by our own clients….
In many instances, chaos testing practices will take components down from a system, but they typically won’t touch certificates. Consider a scenario where a regular gameday chaos test is utilized to use an expired certificate in the course of pulling out working components of a system to see how systems behave. This can quickly achieve a better understanding of how the system will react to failure.
In addition to enhanced testing practices, there are many vendors in the space such as Venafi, Sectigo, and AWS (Certificate Manager) that offer automated certificate management solutions (I do not have any relationship with these or other commercial vendors). If you work at a company with sufficient budget and scope to have to manage thousands of (or more!) certificates, then vendor solutions are worth exploring to help wrangle the certificate renewal process. If you just have a few certificates that are publicly accessible, and you would like to set up automated checks that don’t cost much, https://ismycertexpired.com/ and https://github.com/drwetter/testssl.sh offer some good checks.
Certification expiration continues to be an issue across many different industries and software stacks. In conclusion:
- Surprises in complex technology systems happen all the time, and certificate expiration issues have cropped up more and more with the centralization of platforms and services.
- For larger environments, vendors do offer automated certificate management. Likewise, open source alternatives exist to help manage certificates.
- Give consideration to testing strategies to uncover what happens in your system when certificates expire. These learnings can give a good gauge of your system’s resilience.
Best of luck in managing your certificates!