From Oral Tradition to Codifying Disaster Recovery in DevOps: Lessons from the Field

By Firefly

Disaster recovery is no longer just about backups—it's about codifying resilience to recover faster, adapt smarter, and stay ahead of the unexpected.

IaC codification

Tips & tricks

Published Jun 17, 2025

Disaster recovery (DR) often lives in the shadows of DevOps, where resilience is a goal but seldom treated with the same rigor as other operational priorities. However, recent high-profile failures have proven once again, how crucial it is to reimagine DR strategies. This is not just data backups––which is what most organizations imagine backups are, but as codified systems encompassing infrastructure, configurations, and operational readiness.

Partnering with DZONE, Firefly recently hosted a webinar where Co-Founder and CEO, Ido Neeman, talked all things disaster recovery: from major fails to little-known best practices. Here's a look at what he covered.

The Evolution of Disaster Recovery

In the era of on-premise IT, disaster recovery hinged on mirrored data centers. These “hot-hot” sites offered real-time failovers, but they were resource-intensive. The shift to cloud computing promised built-in resilience, with providers like AWS and Azure offering availability zones and multi-region redundancy. Yet, as the CrowdStrike and Unisuper incidents show, relying solely on cloud providers for DR introduces vulnerabilities—especially in today’s multi-cloud, hybrid architectures.

The promise of cloud simplicity often blinds organizations to the complex web of dependencies spanning Kubernetes, serverless functions, third-party SaaS, and infrastructure-as-code (IaC). While cloud providers handle physical resilience, the shared responsibility model leaves gaps in business continuity planning.

Beyond Data Backups: The Role of IaC

Traditional DR practices focus on data—replicating databases and virtual machines. It’s no surprise then that many companies have sprung up to answer this need from Clomio, to HYCU, Cohesity, Rubrik (you get the idea)––all backup your data. However, as the Unisuper incident demonstrated, data backups alone are not enough to avoid severe downtime. These solutions often support 3-7 core cloud services, but modern organizations rely on dozens, if not hundreds, of interconnected services. What happens to the Kubernetes configurations? What about infrastructure as code?

That’s why this approach is essential but insufficient. Without the supporting infrastructure (IAM roles, networking configurations, storage buckets, etc.), recovered data remains unusable, likewise without backing up these critical elements, organizations leave significant gaps in their disaster recovery plans.

Infrastructure-as-code after it revolutionized the way we configure, deploy and manage our infrastructure, has done it again––and emerges as the linchpin for modern DR strategies. By codifying cloud configurations into Terraform, Helm charts, or similar tools, organizations can:

Version and back up entire infrastructure states.
Redeploy systems rapidly, minimizing downtime.
Align configurations with business continuity goals.

Take Unisuper’s week-long recovery from a Google Cloud glitch. While their data was secure, the lack of a production-ready infrastructure backup prolonged downtime, making the case for the importance of backing up IaC alongside data.

Building Resilience Through Codification

So what does such a modern end-to-end operationally ready DR plan actually look like? It includes several elements that are the core to being able to rapidly restore when failure happens.

Below is the shortlist of good practices for your disaster recovery strategy:

Codify Everything: Ensure all cloud resources—across AWS, Azure, Kubernetes, and SaaS integrations—are codified. Tools like Terraform for cloud infrastructure and Helm for Kubernetes simplify this process.
Monitor for Drift: Regularly compare IaC configurations with live environments to detect and address deviations.
Modularize IaC: Use modules and variables to make configurations reusable and agnostic to specific accounts or regions. This approach supports rapid redeployment in different zones or clouds.
Back Up Continuously: Treat IaC backups like data backups, with regular snapshots stored securely and versioned in repositories like Git.

DR and SRE: The Intersection of Metrics, Trust, and Compliance

Let’s think about disaster recovery in the context of Site Reliability Engineering (SRE) practices, or through the lens of DORA metrics like Mean Time to Recovery (MTTR)––which today have become the backbone of an evolved DevOps practice. MTTR serves as a critical measure of a team’s ability to respond to incidents and restore services, directly influencing DRP effectiveness.

Modern SRE practices leverage error budgets to balance innovation with reliability. Error budgets allocate acceptable downtime thresholds, and exceeding these budgets can erode client trust, breach compliance requirements, and negatively impact service-level agreements (SLAs). These frameworks can serve as great guiding tools for DR strategies to ensure that recovery plans align with the broader operational goals of maintaining uptime and meeting compliance mandates.

For example, a robust DRP that codifies infrastructure as code not only accelerates MTTR but also strengthens an organization’s ability to meet regulatory requirements such as GDPR, SOC 2, and ISO 27001. Proactively maintaining DR readiness ensures that organizations can handle incidents without violating compliance obligations or damaging customer trust.

From Fails to Frameworks

Recent mega failures & drills offer valuable insights into why codified DR is essential. These examples demonstrate the risk in relying on traditional approaches to resilience and illustrate the need for a modern, IaC-driven strategy:

CrowdStrike: A flawed software release highlighted the fragility of interconnected stacks. A small dependency from a single vendor led to cascading failures of Microsoft-based systems around the world at a scale never witnessed before.
Unisuper: Despite Google having deleted all of their backups, miraculously the DevOps didn’t trust a single cloud and backed up to another cloud vendor. While that backup saved the day, the lack of codified infrastructure significantly prolonged their recovery from a Google Cloud failure––as porting configuration across clouds is still no easy feat to be achieved manually.
Hospitality Industry Drill: In September 2023, MGM Resorts suffered a major breach from a well-known hacking ring that was largely a social engineering attack to take over their Okta services. To prevent this from happening again and increase awareness a hospitality industry drill was held recently. One of the outcomes of this drill that simulated MGM-style breach revealed the necessity of immutable, modular IaC for rapid recovery without account-specific dependencies.

These incidents drive home the importance of preparing for the unexpected, and to continuously be refreshing our strategies as technology and hackers evolve. This highlights the gaps in conventional DR planning and strategies––and how decision-makers need to reassess, and at times, hard reset their existing DR strategies to align with current requirements and technology practices.

Practical Steps to Get Started

So while this may all sound daunting, and easier said than done, there are ways to dispel the FUD of having to update and revisit your DR strategies to align with today’s good practices.

To implement a robust DR strategy:

Evaluate IaC Coverage: Luckily we’re at an age where there are plenty of tools built to address these challenges. It’s now easier than ever to assess your current infrastructure for codification gaps (and Firefly is just one of the tools available to help with this).
Codify Incrementally: You don’t need to build the entire fortress in a single day. Like all technology migrations, start with critical resources and expand to cover SaaS products in your supply chain, Kubernetes, and hybrid setups.
Adopt Monitoring Tools: Use tools that help you detect drift in real time to ensure live configurations align with IaC.
Train Teams: Empower DevOps and platform engineering teams to integrate codification into daily workflows, and to test backups for operational readiness––from the data to the configs regularly.
Test and Iterate: Once you’ve validated your backups, you should also be sure to conduct regular and full DR drills to ensure recovery timelines and uncover gaps in existing practices and backups.

Once you have a unified cloud agnostic backup, with a validated ability to restore, and visibility into your overall cloud asset coverage and codification, you can have a lot more confidence in your ability to reduce downtime and recovery costs––while at the same time ensuring compliance, as well as greater security protections.

DR as a Competitive Advantage

Operational downtime today can cost millions and damage reputations––sometimes in ways that aren’t recoverable. DR is no longer optional––but DR practices themselves also need to be revisited and updated from time to time to ensure their operational readiness.

IaC is the backbone to modern resilience, enabling organizations to shift from reactive recovery to proactive readiness. Codifying configurations ensures not only rapid recovery but also the agility to adapt to new challenges, turning DR into a strategic asset.

Infrastructure-as-code is the key to modernizing disaster recovery practices, enabling businesses to achieve faster recovery times, stronger compliance, and greater operational confidence. Proactively codifying, backing up, and testing infrastructure ensures readiness for the unexpected and positions organizations to better survive incidents that are not a question of if they will happen, but rather when and your operational readiness to handle them. The future of disaster recovery is codified and can no longer be based on oral tradition that doesn’t withstand the test of scale or time.

For an even deeper dive, watch the "Cloud Disaster Recovery: Lessons in Resilience from Recent Fails" webinar, on demand and in full.

‍

Featured blog posts

6 Actionable IaC Tips for Cloud Practitioners in 2025

Managing ClickOps and Configuration Drift in AWS Using Firefly’s Event Center

Getting Started with Infrastructure as Code (IaC) and Terraform

Related case studies

Aspyr gains visibility and control in the wake of cloud chaos

How AppsFlyer achieved 84% greater platform engineering efficiency with Firefly

How Aqua Security achieved 100% visibility and governance over their infrastructure

Play Asset Mutations Racer

Welcome to the Asset Mutations Racer

Your mission: track, manage, and control changes across your entire cloud ecosystem.

An asset mutation occurs when an asset revision is made in your cloud infrastructure. Some are beneficial and lead to a well-controlled cloud, but others are harmful, creating risk and waste.

Use your ↑up and ↓down arrow keys to collect as many beneficial asset mutations as possible.

Avoid harmful asset mutations! Firefly enables rollbacks, but—in this game—you are only allowed 3. When you apply a harmful mutation and are out of rollbacks, your services will be disrupted and it is game over.

Play Drift Defender

Firefly Drift Defender

Score: 0 | High Score: 0

Welcome to Firefly Drift Defender!

Your mission is to prevent drifts in your cloud infrastructure. A drift occurs when the desired state defined in your configuration files doesn't match the actual state of your cloud infrastructure, which can cause deployment issues and security risks.

In this game, you are trying to prevent drift in your Databases, Network, Server, and Storage configurations. When a drift occurs, a resource will catch on fire.

Click on the drifted resource to automatically remediate it, and earn points.

Sadly, your platform engineers are making several manual changes in your cloud consoles, so you'll experience more drifts over time. When you have 5 drifts simultaneously, your services will be disrupted and the game will be over.

Game Over

Your Score: 0

Your High Score: 0

Play Ghosty Cloud

Firefly Ghosty Cloud

score2: 0 | High score2: 0

Welcome to Firefly Ghosty Cloud!

Your mission is to avoid ghosted resources in your cloud infrastructure.

A ghosted resource was once created through Infrastructure as Code (IaC) but has since been deleted or is missing from the actual cloud infrastructure.

In this game, use your spacebar to avoid ghosted resources in your cloud.

The further you go without encountering a ghost resource, the more points you earn for having a reliable and immutable cloud infrastructure.

Game Over

Your score: 0

Your high score: 0

From Oral Tradition to Codifying Disaster Recovery in DevOps: Lessons from the Field

The Evolution of Disaster Recovery

Beyond Data Backups: The Role of IaC

Building Resilience Through Codification

DR and SRE: The Intersection of Metrics, Trust, and Compliance

From Fails to Frameworks

Practical Steps to Get Started

DR as a Competitive Advantage

Featured blog posts

6 Actionable IaC Tips for Cloud Practitioners in 2025

Managing ClickOps and Configuration Drift in AWS Using Firefly’s Event Center

Getting Started with Infrastructure as Code (IaC) and Terraform

Related case studies

Aspyr gains visibility and control in the wake of cloud chaos

How AppsFlyer achieved 84% greater platform engineering efficiency with Firefly

How Aqua Security achieved 100% visibility and governance over their infrastructure

Curious to learn more about IaC? Explore our free resources or schedule a demo.

Play Asset Mutations Racer

Firefly Asset Mutations Racer

Welcome to the Asset Mutations Racer

Your Cloud Asset Mutations

Game over

Play Drift Defender

Firefly Drift Defender

Welcome to Firefly Drift Defender!

Your Infrastructure

Game Over

Play Ghosty Cloud

Firefly Ghosty Cloud

Welcome to Firefly Ghosty Cloud!

Game Over