Over many decades of provisioning infrastructure and cloud resources, we’ve learned that doing so manually can be tedious and error-prone. This was the impetus to the evolution of tooling over time from automation tools like Chef, Puppet and Ansible to infrastructure-as-code (IaC) frameworks like CloudFormation, Terraform, Pulumi and others.  

IaC has been the backbone to delivering best practices and guardrails to enable engineering teams to manage modern and complex infrastructure similarly to software. This approach has incorporated similar practices of version control, peer review, CI/CD tooling, security vulnerability scanning, immutability and cost projection into infrastructure management. IaC introduced the method of employing a single template with variables to deploy environments consistently, reducing errors and simplifying operations. This is particularly useful in disaster recovery scenarios, enabling quick redeployment and recovery from issues with everything versioned and managed consistently.  

In this post I’d like to dive into some of the lessons I’ve learned over nearly a decade of starting from managing infrastructure and writing scripts through the emergence of IaC that changed the way we all think about and manage our infrastructure at scale.  I’ve found that with several guiding practices managing your IaC at modern cloud fleet scale is able to provide efficiency and security benefits for your engineering organizations.

Use the DRY Pattern

The DRY (Don't Repeat Yourself) pattern has become very popular in software engineering over the last decades for the automation of code quality through IDEs and linters by enforcing code policies and formatting through boilerplate, templates and more. Adopting this very same pattern for infrastructure as code, helps avoid repeating code through the modularizing of components which significantly improves maintainability. 

As infrastructure scales, managing a large codebase with repeating components becomes cumbersome and error-prone. By using modules, the infrastructure codebase remains clean, organized, and efficient, much like applicative codebases. Changes to infrastructure configurations are more straightforward, as modifications to a module automatically reflect wherever the module is used. This leads to more efficient development cycles, faster deployment times, and a reduced risk of introducing errors during updates.

For instance, a Virtual Private Cloud (VPC) module can be reused across projects, preventing each team member from creating separate VPCs. Terraform modules facilitate this reuse, streamlining the management of shared components like VPCs, EC2 instances, and their associated resources.

The DRY principle in IaC also promotes better version control and peer review. Each module can be versioned independently, allowing for precise tracking of changes and easier rollback if issues arise. Peer review processes are enhanced as team members can focus on specific modules, ensuring higher quality and adherence to best practices. This modular approach also facilitates collaboration, as developers can work on different modules simultaneously without interfering with each other's work.

Utilize the Registry

One of the common features across infrastructure-as-code tools is the registry that comes with it.  These registries are central component repositories, where you can find, share and publish modules and packages that the community can leverage.  These include everything from the most common modules for cloud providers like AWS, Azure and Google Cloud, to custom modules and tooling specific components.  Nearly all of the modern IaC platforms provide their own dedicated registries that work seamlessly with their platform. Some examples include Terraform, Pulumi, CloudFormation, and even Helm in the Kubernetes ecosystem.

If we take Terraform as an example (and the most popular IaC platform to date - even with the changes in license and recent acquisition announcement), utilizing the Terraform Registry can save development time by providing pre-built modules. These modules encapsulate reusable components of infrastructure, ranging from simple configurations like setting up a VPC to complex deployments involving multiple interconnected resources. The registry offers a vast collection of modules that have been tested and validated by other users, providing a reliable foundation for building infrastructure.

Leveraging registries is important because it accelerates the development process by allowing users to adopt proven solutions rather than starting from scratch. By using pre-built modules, teams can quickly implement infrastructure components that adhere to best practices and are optimized for performance and security. This not only reduces the time and effort required to deploy infrastructure but also helps ensure consistency across different environments.

However, it is crucial to scan these public modules for vulnerabilities and misconfigurations before use. Despite the benefits of using shared modules, there is a risk of introducing security issues if the modules contain malicious code or unintended misconfigurations. For instance, a module that creates IAM roles could inadvertently grant excessive permissions, leading to unauthorized access. Therefore, it is essential to conduct thorough security reviews and vulnerability scans of any modules sourced from the Terraform Registry to mitigate these risks.

Maintain Consistency

Maintaining consistency is crucial for managing large-scale infrastructure effectively. Consistent naming conventions and practices not only make the codebase easier to understand and maintain but also facilitate collaboration among team members. This is because, as infrastructure grows, maintaining clarity and organization in the codebase becomes increasingly important. 

Standardizing naming conventions for resources, modules, and variables helps team members understand the purpose and scope of each component, facilitating easier maintenance and collaboration. This consistency reduces confusion and errors, making it easier for new team members to get up to speed and for existing members to manage and update the infrastructure.

Consistent naming conventions will also enable the definition of better processes and practices that can help to track changes and the evolution of the IaC codebase.  For example, once the naming convention is practiced and enforced, it becomes easier to document the system and its changes, automate linters and validators, conduct effective code reviews, modularize, and maintain consistent directory structures and resource tagging.

By implementing these naming conventions and good practices, teams can maintain a clean, organized, and understandable infrastructure codebase. This consistency not only enhances collaboration and reduces errors but also makes the infrastructure more scalable and easier to manage as it grows.

Manage State Files Properly

One aspect that has gained attention recently due to the release in the Terraform open-source fork OpenTofu is state file encryption.  This has been a long-standing feature request from the Terraform community (with code contributions going as far back as 2016).  

The reason for this, is that managing Terraform state files properly is a critical aspect of Infrastructure as Code (IaC) best practices. The state file represents the current state of the infrastructure and is essential for tracking and applying changes. Proper management ensures consistency, prevents data corruption, and supports collaborative workflows. 

The importance and methods for managing state files cannot be overstated.  Centralized state management allows multiple developers to work on the same infrastructure without conflicts, where proper state management supports collaboration by providing a shared, up-to-date view of the infrastructure. 

By ensuring the state file is consistent and not corrupted , it’s possible to prevent issues that can arise from concurrent modifications, manual edits, and data corruption, maintaining the integrity of the infrastructure. Also, proper state management includes regular backups and versioning, enabling quick recovery in case of accidental deletions, corruption, or other disasters. This ensures minimal downtime and data loss.

Good Practices for Managing (Terraform/OpenTofu) State Properly

  1. Use Remote State Storage: Instead of storing the state file locally, use remote storage solutions such as AWS S3, Google Cloud Storage, or Azure Blob Storage. Remote storage centralizes the state file, making it accessible to all team members and CI/CD pipelines. This approach ensures everyone works with the same state, preventing conflicts and inconsistencies.
  2. Implement Locking Mechanisms: To prevent concurrent modifications, use a locking mechanism. For instance, AWS DynamoDB can be used to lock the state file during updates. Locking ensures that only one process can modify the state at a time, preventing race conditions and data corruption.
  3. Avoid Manual Edits: Although the state file is human-readable, manual edits can lead to corruption. Always use Terraform commands to make any changes to the state file. This practice maintains the file’s integrity and ensures that changes are applied correctly.
  4. Regular Backups and Versioning: Regularly back up the state file to prevent data loss. Enable versioning on the storage bucket to keep previous versions of the state file automatically. This allows for easy recovery in case of accidental deletions or corruption.
  5. Secure State Files: Ensure the state file is encrypted and access is restricted to authorized users and services. Encrypting the state file protects sensitive information, such as access keys and credentials, from unauthorized access. Implement strict access controls to limit who can read and modify the state file.

By following these high-level best practices, organizations can manage Terraform state files effectively, ensuring the consistency, security, and availability of their infrastructure. Proper state management supports robust and scalable infrastructure deployments, facilitates collaboration, and enhances overall infrastructure integrity. 

While the specific implementation details and tools for state management vary across IaC platforms, the underlying principles of maintaining a consistent, reliable, and up-to-date infrastructure state are universally important. This ensures that the infrastructure remains robust, scalable, and aligned with the defined configurations.

Leverage Data Sources

Leveraging data sources is a powerful strategy in Infrastructure as Code (IaC) management, applicable not only to Terraform but to other IaC tools as well. Data sources allow IaC configurations to dynamically query and retrieve information from cloud providers and APIs, which enhances flexibility, adaptability, and maintainability of the infrastructure. This approach minimizes hardcoding of values, such as AMI IDs or network configurations, ensuring that the infrastructure always uses the most current and accurate data, leading to fewer errors and simplifying updates––making the codebase more efficient to manage.

The benefits of using data sources are evident across different IaC tools. Whether using Terraform, Pulumi, AWS CloudFormation, or Azure Resource Manager, incorporating data sources helps create more dynamic and reusable configurations. These configurations can be adapted to various environments without modification, maintaining consistency and promoting best practices. For example, by querying for the latest VM images or network IDs, configurations stay current with minimal manual intervention, supporting both development and production environments seamlessly.

Additionally, leveraging data sources aids in mitigating infrastructure drift, a common challenge in IaC. Drift occurs when the actual state of the infrastructure diverges from the state defined in the IaC configuration, leading to inconsistencies and potential security risks. Regular drift detection through tools and integrated checks in CI/CD pipelines ensures that any changes are promptly identified and rectified, maintaining the integrity and reliability of the infrastructure. This universal approach to IaC management through data sources ensures consistency and security of deployments across various platforms and tools.

Don’t Reinvent the Wheel: A Decade of IaC Lessons

Managing infrastructure as code (IaC) has transformed cloud operations at scale over many years. The journey from manual provisioning to leveraging advanced IaC tools like Terraform, Pulumi, and CloudFormation has brought about an evolution and shift in how modern infrastructure is managed. Through much experience in managing IaC at scale, we have learned that adopting best practices such as the DRY pattern, utilizing registries, maintaining consistency, managing state files properly, and leveraging data sources, help engineering teams achieve greater efficiency, security, and scalability.

By implementing these practices, engineering teams can effectively manage complex infrastructures, streamline operations, and enhance the overall robustness of their deployments. These lessons learned from writing thousands of lines of IaC provide a foundation for building resilient, scalable, and secure cloud environments, ultimately enabling organizations to operate more efficiently and respond more swiftly to changes, incidents and downtime, and recover more rapidly where our digital services are becoming ever more critical to powering our day-to-day life.