Large software engineering organizations face a range of monitoring challenges that can add friction and complexity when managing systems at scale.  As companies grow, monitoring can run into the same challenges that managing code at scale encounters.  You often have multiple engineers, or even engineering organizations, working on the same configurations and dashboards, which leads to difficulties managing ownership and tracking changes. The result can be inconsistencies in priorities, performance objectives, and measurements. 

Engineering and infrastructure underwent similar processes that brought us into an era of git, collaborative development, infrastructure as code (IaC) and GitOps where infrastructure is managed in much the same way as application code.  Monitoring, a discipline that is a subset of DevOps and production operations, can derive similar benefits from being managed and as code - and this should become the norm.  By managing your monitoring and observability through GitOps you’ll gain improved governance, change management, and proper synchronization between development and production.

In this post, we’ll dive into where monitoring breaks down when not managed as code, and provide a concrete example for how you can manage your monitoring as code.

What is Monitoring-as-Code?

Before diving into the benefits and challenges of Monitoring-as-Code - let’s first level-set. What is Monitoring-as-Code? It’s an approach to managing and implementing monitoring and observability practices by treating monitoring configurations and processes as code. This method aligns monitoring with modern DevOps and infrastructure-as-code practices. 

Some key aspects of Monitoring-as-Code:

  • Definition and scope: codification of the entire observability lifecycle, including data collection, diagnosis, alerting, processing, and even automated remediation, all defined as code.
  • Version control: like other "as code" practices, you can monitor configurations to be versioned, shared, and reused, improving visibility, reliability, and repeatability.
  • Integration with CI/CD: enables monitoring to be deployed alongside applications via a unified pipeline, allowing early detection of issues during testing and deployment.
  • Declarative approach: Monitoring configurations are specified using high-level syntax, describing what to monitor, when to alert, and how to process data.
  • Tooling: MaC can be implemented using various tools, including infrastructure-as-code platforms like Terraform, or specialized monitoring tools with code-based configuration options.
  • Scalability and consistency: By codifying monitoring practices, organizations can more easily scale their observability efforts across multiple products and teams while maintaining consistency.

Common Monitoring Challenges at Scale

As organizations grow, the number of applications, systems, and services they manage can increase significantly as well. All of these applications, and particularly those that have specific user or business impact need to be monitored properly to ensure consistent SLAs and operation. It can be challenging to collect and analyze data from all of these sources and to identify issues quickly. 

That’s why in large engineering organizations, it's common for multiple teams to share monitoring dashboards to get a holistic view of system performance. However, this can create complexity around ownership and access controls. Teams may have different priorities and may need different views of the same data, which can make it challenging to create shared dashboards that meet everyone's needs. Additionally, managing permissions and ensuring that only authorized personnel can access sensitive data can be a complex process.

Another very common issue many organizations encounter is around the different monitoring tools and technologies available. It can be challenging to select and implement the right tools for an organization's needs, and many times different engineers prefer different tools––based on their own domain expertise and comfort level with their tool of choice. This can lead to tool sprawl and make it difficult to integrate data from different sources. 

As organizations grow, this can create a situation of disparate monitoring tools and services that are difficult to maintain a cohesive monitoring strategy around, and can make it difficult to get a comprehensive view of system performance.

What often happens is different teams may use different monitoring tools or have different monitoring practices, which can create conflicts when it comes to maintaining monitoring systems. For example, one team may prefer to use a certain tool or process, while another team may prefer a different tool or process. When these tools or processes overlap or conflict, it can create complexity around maintenance and troubleshooting.

This is exacerbated when all of this is manually managed, there is little oversight or governance, and there is no change management or historical information.  So many engineers can suffer a lot of frustration when dashboards they were working to optimize are suddenly changed by another engineer and all of their previous work is lost and not recoverable.  

Monitoring-as-Code to the Rescue

This is where monitoring as code comes in, to enable engineering organizations to apply the same practices of configuring, defining and managing monitoring infrastructure and processes within an organization as code, just like infrastructure. This approach provides a number of benefits that can help ensure consistent monitoring operations, including:

  • Standardization
  • Version Control
  • Automation
  • Collaboration

By choosing to configure and manage your monitoring as code, policies and configurations can be defined and maintained using code. This ensures that all systems and applications are monitored consistently and according to the same standards, reducing the likelihood of errors and inconsistencies. This is also where you can choose to apply engineering-wide GitOps practices, and leverage additional tooling to do so at scale, to ensure only monitoring that abides by these policies and guidelines can be deployed to production.

Another benefit of monitoring as code, is the application of version control and change management, enabling changes to be tracked and versioned over time, with historical data about any changes. This makes it easier to understand how dashboards and policies have evolved, and to roll back changes if necessary, and ensure that no dashboards are lost when the same dashboards are employed by multiple engineers. 

This makes it possible for multiple team members to work together on monitoring policies and configurations using version control systems and collaboration tools. This can lead to more efficient and effective monitoring processes, as well as better communication and knowledge sharing among team members.

Above all though, managing operations as code unleashes one of the greatest benefits, which is automation of the many tasks involved with managing monitoring at scale. These include the creation of new monitoring policies, the deployment of monitoring agents, and the collection and analysis of monitoring data. This reduces the burden on IT teams and ensures that monitoring processes can be executed quickly and reliably––with little human error. With platform engineering gaining popularity, this also makes it possible for platform engineering teams to create IaC modules for monitoring, empowering developers to self-serve for all of their monitoring needs, with best practices already incorporated into their services.

Converting Monitoring to As-Code

While all of this sounds great - and you’re probably thinking, “okay, where do I sign up?”  One of the greater challenges is actually converting existing operations to as-code infrastructure.  Going forward, it is easier to ensure that all new monitoring and dashboarding needs are only created as-code, but what do you do if you already have all of the monitoring systems running in production that you’d like to convert to code?

Below we’ll demonstrate through a real-world example how you can convert monitoring to code through a Grafana example - leveraging Terraform and Flux for the automation.

What We’ve Learned

Eventually your monitoring and dashboards are the backbone of your business, and they need to be continuously available. Applying version control and history will ensure that even when things do go wrong, restoring operations will be quick, and less painful.

Monitoring-as-code provides a more consistent, automated, and collaborative approach to monitoring operations that can help organizations ensure they maintain the SLAs, reliability and availability users expect from today’s systems and applications.  You can gain the same benefits of managing monitoring as code, as any other part of your infrastructure, and the migration is worth investing in and can even be automated with the right tooling.