Unless you were in a cave (or a deep slumber) on the morning of July 19, 2024 you either heard about or were impacted directly by the CrowdStrike outage––the largest single vendor outage in recent memorable history that will end up costing Fortune 500 companies alone more than $5 billion in direct losses. The outage lasted several hours and impacted some of the most critical systems around the globe - airports, hospitals, banks, retail firms and many other mission critical systems.  

It took CrowdStrike five days to release its preliminary post-mortem, which reads “Preliminary Post Incident Review (PIR): Content Configuration Update Impacting the Falcon Sensor and the Windows Operating System (BSOD)”.  And the first thing that immediately jumped out at me from this post-mortem headline, as a long-standing and proud DevOps practitioner––were two business critical words “Configuration Update”. Basically a configuration update that brought the entire world to its knees.

With the immediate understanding of the scale of the outage, some initially assumed it was a cybersecurity attack. However, it was quickly understood that this paralysis was an operational issue related to an update that was rolled out by CrowdStrike at 4:09UTC on July 19th that impacted Microsoft systems on the kernel level, causing worldwide BSODs. Basically a small misconfiguration of CrowdStrike’s agents.

Airports were in mayhem with many flights canceled or missed––with Delta, American and United Airlines having suffered the greatest losses. On top of air travel, payment and business transactions were unavailable or stalled due to many banking applications being taken offline, healthcare services were impacted directly. 

Scaremongering and hype cycles often lead us to believe that cyber threats are the most dangerous, but the truth is - these statistically usually impact single vendors and result in monetary damage that you’ll read about in the next day’s newspaper. Systems operations, DevOps, and infrastructure management––often considered the less glamorous side of engineering, will impact your daily lives directly through critical systems being taken down, and won’t be some anecdotal outage or breach story that happened to someone else. Operations failures will ALWAYS hurt. You’ll miss an important flight.  You’ll not be able to take out money or pay for a service when you really need it. They may impact your health care directly.

Operations are so central to running critical infrastructure, and despite all of this, it is still largely considered the plumbing of engineering. Until today.

CrowdStrike Outage as the Ops Turning Point

In enterprise security questionnaires, we often boast about our disaster recovery (DR) and business continuity plans. But are we truly prepared? The reality may be far from what we claim.

It takes one single major fiasco to shuffle all the cards. The same way that supply chain security was a mere afterthought until the SolarWinds breach, today thanks to the CrowdStrike outage DevOps and specifically Infrastructure-as-Code and cloud configurations are taking center stage. This is the moment in time when we understand that infrastructure as code and cloud configurations cannot be an afterthought - they are what’s powering your entire business.

This outage taught us all how indebted we should be to our Ops teams, and how little appreciation and respect they really get. Take the recent UniSuper outage, where the pension fund was saved from disaster by a Google Cloud goof thanks to the DevOps team’s foresight of having an additional backup in a completely separate cloud.  Without this backup, the company would largely have been a footnote in history.  Other companies were destroyed by such Ops mistakes like the Knight’s Capital horror story.

What can we learn?

Well, data was saved, because everyone cares about data. Virtual Machine images were also backed up, because management cares about Backend developers. But the cloud infrastructure?! Meaning all of your VPCs, IAM roles, Gateways, queues, DNS records, the list goes on… They WERE NOT backed up. 

Repaving, rewiring and reproducing the infrastructure code and configurations took a full week, where the application and service were down for the entire time. That is a painful blow to ANY business, but certainly a multi-billion dollar operation.

We’ve now learned the hard way that Ops keeps the world going ‘round. So why isn’t Ops and cloud configuration treated the same way as code despite IaC being largely ubiquitous?  

What are we still getting wrong?

Let’s reset our software-centric mindset and stop ignoring cloud infrastructure. Eventually it is the mission critical platform that powers ALL of your business services. Remember, cloud itself is Software-Defined Infrastructure, and should be treated as an extremely important piece of software.

Your Business is Your Cloud Configurations

One interesting thing to note is that despite UniSuper having a backup in a separate cloud, largely saving them from utter ruin––the one thing that was not backed up impacted their time to recovery: their configurations.  

There are so many companies leading the trend in every single kind of backup from code (repositories) to clouds, storage and data and everything in between, but not a single one offers cloud configuration backup. These configurations, custom to each cloud and environment setup, are composed of user-specific access control and privileges, cloud, cost optimizations, performance and load balancing, network configurations all specific per organization. It’s no surprise that it took UniSuper an entire week to recover even though all their data was backed up. These backups were not operational and production-ready from a configuration perspective, and all of that needed to be restored manually and largely via oral tradition.  

What can we learn from CrowdStrike and UniSuper outages combined?  

Credit: xkcd

We all learned the hard way on July 19th that this popular xkcd meme is pretty close to reality.

Unlike data, which typically has backup protocols, IaC often does not. Yet, these configurations hold the world by its nuts and bolts, its crown jewels—IT’s most strategic assets. This oversight can have dire consequences as we’ve learned from UniSuper’s recovery time, and the fallout from CrowdStrike. Even if you’re using IaC but not applying the same rigorous standards as your typical SDLC, you’re missing the true value that managing your infrastructure, literally as code, can deliver.

Developers follow Software Development Life Cycle (SDLC) practices diligently, but when it comes to changing configurations or provisioning infrastructure, these practices are often ignored or out of scope for the same procedures. This gap needs to be bridged, starting with ensuring that all configuration management must be codified, not just for new deployments but also for legacy systems. Next, the same guardrails of testing, governance, backup, and consistency need to be applied to infrastructure configurations as application code.

This means in practice that:

  1. Production-ready backups should not be limited to your data, to minimize your time to recover from a disaster––make sure your cloud and system configurations are also backed up and production-ready. TO BE CLEAR: Only a 100% codified cloud is a “Disaster Recovery ready” cloud.
  2. If you haven’t tested it - it’s not production-ready.  Practice testing your backups as regularly as you backup your critical cloud assets.
  3. It’s not enough to be doing Infrastructure-as-Code if you’re not applying policy and governance, which is the backbone to consistent operations. Bringing us to our next point.
  4. Consistency is key. Cloud is software-driven infrastructure, and if you’re running the same application across clouds and regions, there’s no reason your cloud configurations, infrastructure service versions or anything else shouldn’t be consistent across all your environments.  These can be the entire difference between reliability and stability to failures.

All of these together can’t guarantee that outages won’t happen, but what they will guarantee is that when they do (and they always do), that your mean time to recover will be shorter with less business continuity impact.  Disaster recovery will be less daunting, and these good practices will essentially bake resilience into your infrastructure code, as much as your application code.

(if your IaC coverage bar looks like this in Firefly, your configurations are not disaster-ready)

Your DevOps Teams are Doing More Than Keeping the Lights On

Disaster Recovery in the cloud is more critical than ever. Yet, how many companies can really guarantee that they are disaster-proof?  

Supply chain security has been a concern for over a decade, but it took the SolarWinds incident for many to realize its critical importance. This mega-scale event provided a similar moment of clarity that highlighted the necessity of not ignoring infrastructure, cloud configurations, and IaC, and even more so –– not taking your ops teams for granted.

Even if you're using IaC, but not applying the same rigorous standards as your typical SDLC, you are essentially missing out on its full value, and the true benefit of codifying your infrastructure. In addition to ensuring your systems are fully codified, the next critical piece is managing your IaC as any other code including testing, governance, backups, and maintaining consistency across clouds and systems. 

Security breaches might make headlines, but operational failures have immediate, tangible impacts on daily life - and Ops will always be more painful and impactful. Your DevOps & Platform teams are doing more than keeping your lights on, they’re keeping you in business. Empower & celebrate them!