Disaster recovery (DR) often lives in the shadows of DevOps, where resilience is a goal but seldom treated with the same rigor as other operational priorities. However, recent high-profile failures have proven once again, how crucial it is to reimagine DR strategies. This is not just data backups––which is what most organizations imagine backups are, but as codified systems encompassing infrastructure, configurations, and operational readiness.
The Evolution of Disaster Recovery
In the era of on-premise IT, disaster recovery hinged on mirrored data centers. These “hot-hot” sites offered real-time failovers, but they were resource-intensive. The shift to cloud computing promised built-in resilience, with providers like AWS and Azure offering availability zones and multi-region redundancy. Yet, as the CrowdStrike and Unisuper incidents show, relying solely on cloud providers for DR introduces vulnerabilities—especially in today’s multi-cloud, hybrid architectures.
The promise of cloud simplicity often blinds organizations to the complex web of dependencies spanning Kubernetes, serverless functions, third-party SaaS, and infrastructure-as-code (IaC). While cloud providers handle physical resilience, the shared responsibility model leaves gaps in business continuity planning.
Beyond Data Backups: The Role of IaC
Traditional DR practices focus on data—replicating databases and virtual machines. It’s no surprise then that many companies have sprung up to answer this need from Clomio, to HYCU, Cohesity, Rubrik (you get the idea)––all backup your data. However, as the Unisuper incident demonstrated, data backups alone are not enough to avoid severe downtime. These solutions often support 3-7 core cloud services, but modern organizations rely on dozens, if not hundreds, of interconnected services. What happens to the Kubernetes configurations? What about infrastructure as code?
That’s why this approach is essential but insufficient. Without the supporting infrastructure (IAM roles, networking configurations, storage buckets, etc.), recovered data remains unusable, likewise without backing up these critical elements, organizations leave significant gaps in their disaster recovery plans.
Infrastructure-as-code after it revolutionized the way we configure, deploy and manage our infrastructure, has done it again––and emerges as the linchpin for modern DR strategies. By codifying cloud configurations into Terraform, Helm charts, or similar tools, organizations can:
- Version and back up entire infrastructure states.
- Redeploy systems rapidly, minimizing downtime.
- Align configurations with business continuity goals.
Take Unisuper’s week-long recovery from a Google Cloud glitch. While their data was secure, the lack of a production-ready infrastructure backup prolonged downtime, making the case for the importance of backing up IaC alongside data.
Building Resilience Through Codification
So what does such a modern end-to-end operationally ready DR plan actually look like? It includes several elements that are the core to being able to rapidly restore when failure happens.
Below is the shortlist of good practices for your disaster recovery strategy:
- Codify Everything: Ensure all cloud resources—across AWS, Azure, Kubernetes, and SaaS integrations—are codified. Tools like Terraform for cloud infrastructure and Helm for Kubernetes simplify this process.
- Monitor for Drift: Regularly compare IaC configurations with live environments to detect and address deviations.
- Modularize IaC: Use modules and variables to make configurations reusable and agnostic to specific accounts or regions. This approach supports rapid redeployment in different zones or clouds.
- Back Up Continuously: Treat IaC backups like data backups, with regular snapshots stored securely and versioned in repositories like Git.
DR and SRE: The Intersection of Metrics, Trust, and Compliance
Let’s think about disaster recovery in the context of Site Reliability Engineering (SRE) practices, or through the lens of DORA metrics like Mean Time to Recovery (MTTR)––which today have become the backbone of an evolved DevOps practice. MTTR serves as a critical measure of a team’s ability to respond to incidents and restore services, directly influencing DRP effectiveness.
Modern SRE practices leverage error budgets to balance innovation with reliability. Error budgets allocate acceptable downtime thresholds, and exceeding these budgets can erode client trust, breach compliance requirements, and negatively impact service-level agreements (SLAs). These frameworks can serve as great guiding tools for DR strategies to ensure that recovery plans align with the broader operational goals of maintaining uptime and meeting compliance mandates.
For example, a robust DRP that codifies infrastructure as code not only accelerates MTTR but also strengthens an organization’s ability to meet regulatory requirements such as GDPR, SOC 2, and ISO 27001. Proactively maintaining DR readiness ensures that organizations can handle incidents without violating compliance obligations or damaging customer trust.
From Fails to Frameworks
Recent mega failures & drills offer valuable insights into why codified DR is essential. These examples demonstrate the risk in relying on traditional approaches to resilience and illustrate the need for a modern, IaC-driven strategy:
- CrowdStrike: A flawed software release highlighted the fragility of interconnected stacks. A small dependency from a single vendor led to cascading failures of Microsoft-based systems around the world at a scale never witnessed before.
- Unisuper: Despite Google having deleted all of their backups, miraculously the DevOps didn’t trust a single cloud and backed up to another cloud vendor. While that backup saved the day, the lack of codified infrastructure significantly prolonged their recovery from a Google Cloud failure––as porting configuration across clouds is still no easy feat to be achieved manually.
- Hospitality Industry Drill: In September 2023, MGM Resorts suffered a major breach from a well-known hacking ring that was largely a social engineering attack to take over their Okta services. To prevent this from happening again and increase awareness a hospitality industry drill was held recently. One of the outcomes of this drill that simulated MGM-style breach revealed the necessity of immutable, modular IaC for rapid recovery without account-specific dependencies.
These incidents drive home the importance of preparing for the unexpected, and to continuously be refreshing our strategies as technology and hackers evolve. This highlights the gaps in conventional DR planning and strategies––and how decision-makers need to reassess, and at times, hard reset their existing DR strategies to align with current requirements and technology practices.
Practical Steps to Get Started
So while this may all sound daunting, and easier said than done, there are ways to dispel the FUD of having to update and revisit your DR strategies to align with today’s good practices.
To implement a robust DR strategy:
- Evaluate IaC Coverage: Luckily we’re at an age where there are plenty of tools built to address these challenges. It’s now easier than ever to assess your current infrastructure for codification gaps (and Firefly is just one of the tools available to help with this).
- Codify Incrementally: You don’t need to build the entire fortress in a single day. Like all technology migrations, start with critical resources and expand to cover SaaS products in your supply chain, Kubernetes, and hybrid setups.
- Adopt Monitoring Tools: Use tools that help you detect drift in real time to ensure live configurations align with IaC.
- Train Teams: Empower DevOps and platform engineering teams to integrate codification into daily workflows, and to test backups for operational readiness––from the data to the configs regularly.
- Test and Iterate: Once you’ve validated your backups, you should also be sure to conduct regular and full DR drills to ensure recovery timelines and uncover gaps in existing practices and backups.
Once you have a unified cloud agnostic backup, with a validated ability to restore, and visibility into your overall cloud asset coverage and codification, you can have a lot more confidence in your ability to reduce downtime and recovery costs––while at the same time ensuring compliance, as well as greater security protections.
DR as a Competitive Advantage
Operational downtime today can cost millions and damage reputations––sometimes in ways that aren’t recoverable. DR is no longer optional––but DR practices themselves also need to be revisited and updated from time to time to ensure their operational readiness.
IaC is the backbone to modern resilience, enabling organizations to shift from reactive recovery to proactive readiness. Codifying configurations ensures not only rapid recovery but also the agility to adapt to new challenges, turning DR into a strategic asset.
Infrastructure-as-code is the key to modernizing disaster recovery practices, enabling businesses to achieve faster recovery times, stronger compliance, and greater operational confidence. Proactively codifying, backing up, and testing infrastructure ensures readiness for the unexpected and positions organizations to better survive incidents that are not a question of if they will happen, but rather when and your operational readiness to handle them. The future of disaster recovery is codified and can no longer be based on oral tradition that doesn’t withstand the test of scale or time.