Why Observability Matters to Site Reliability Engineers
Helen Beal | July 21, 2020

This is the first in a series of three post series exploring the relationship between observability and a set of SDLC practices.

This is the first in a series of three post series exploring the relationship between observability and a set of SDLC practices.

This is the first in a three-post series themed around Ops-led DevOps, where I’ll explore the relationship between observability and a set of software delivery lifecycle practices that support the adoption of DevOps practices and the transition from project to product-centric ways of working. I’ll start with Site Reliability Engineering, move onto Value Stream Management and finish with Continuous Delivery.

Defining Observability

Let’s start by defining observability. A key challenge when working with software is that it’s invisible. Observability acknowledges that, and demands that engineers consciously code their product to emit metrics and logs that allow them to observe the invisible. This aligns with the DevOps goal to have ‘telemetry everywhere’; that is, the active collection of remote data from all parts of a system. It sounds like monitoring, but it’s more than that; it’s not just telling you that a service is working or not, it’s giving you the data to discover root causes and potential solutions. Traditional monitoring alerts to potential problems, but the onus is on the operator to go look and figure out what it is. Observability and AIOps collate and analyze the data (from multiple monitoring systems that likely don’t share information between themselves effectively) on behalf of the operator, contextualize the problem and provide guidance on how to act.

Defining Site Reliability Engineering

Site Reliability Engineering (SRE) started at Google which has generously shared their learning and models with the community. It has a complex relationship with DevOps, and practitioners are frequently called upon to define the differences between the two. DevOps has evolved to be more than agile system administration and to be concerned with the optimization of flow from idea to realization in a value stream whilst balancing throughput and stability. SRE is primarily concerned with the reliability of the product and has specific practices around toil reduction, Service Level Objectives (SLOs), error budgets and the ‘wisdom of production’ that improve the Ops part of DevOps. SRE and DevOps, therefore, are complementary and, rather than being hung up on the jargon, teams should be encouraged to self-discover constraints and make experiments for improvements based on the industry’s learning.

Culture and Humans

Sidney Dekker introduced the world to the concept of safety culture in his work ‘The Field Guide to Human Error’. A safety culture is one where people are not fearful of sharing bad news, where it’s accepted that systems fail and where every incident is seen as a learning opportunity. An organization can have an observability culture too, where every human knows that the systems they build or use are built and maintained with observability in mind - it’s a cultural norm and it’s not just for the IT Operations people.

Safety culture, observability culture and SRE culture all share the attitude that incidents are an investment and organizations should analyze their posture on tolerance to failure as learning, transparency and visibility into both work and systems, and building resilience in. In addition, teams should consider how the organizational design, the assignment of roles and job titles, and the construction of small, autonomous, multi-functional teams affect cognitive load and the flow of value.

SRE introduces us to the concept of toil which Google defines as: “... the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”

Toil is demotivating and can cause burnout. Eliminating toil, therefore, should also be a cultural improvement imperative.

SRE also centers on Service Level Objectives (SLOs) for target reliability levels and associates error budgets to them, which define what happens when SLOs are not met. It’s essential that all stakeholders agree on both the SLOs and the error budgets. Not only does the process of gaining agreement provide a platform for conversation and collaboration, the ongoing inspection drives adaptation and all of this activity drives trust. Trust is the key factor in a healthy DevOps culture since high trust promotes low friction which, in turn, lowers the cost of operation.

In order to build trust, teams need to collaborate effectively within themselves and with each other; value stream-centric thinking and tools such as value stream mapping and management can support collaborative outcomes, as can ChatOps (the integration of instant, group messaging with the DevOps toolchain). Organizations should think of ‘NoOps’ (where all IT Operations work is automated away through developer access to cloud technology) as an organizational antipattern and instead focus on forming multi-functional people and teams where ‘they build it, they own it”.

Value Streams and Processes

It’s tempting to think that observability sits at the end of the value stream, in the live or production environments and also that SRE is only concerned with production but, as established, DevOps is concerned with the end-to-end value stream, from ideation to value realization. Additionally, a key tenet of DevOps is to ‘shift left’; that is, to perform checks on quality as early as possible in the delivery cycle. Quality isn’t just about functional requirements, but also non-functional requirements such as performance, security and compliance. What we draw from this then, is that both observability and SRE are not only concerned with the production environments, but that to properly ensure reliability in the live systems, consideration must be made from the earliest parts of the value stream. Preproduction must be observed too.

Toil is time hungry. Toil causes queues and queues cause wait time and indicate hand-offs between teams which are both causes of delay in value streams.

At the process level, if we focus on incident management, SREs want to both minimize the volume of incidents and also fix them as quickly as possible. They are likely to have SLOs specific to downtime and also Mean Time to Recovery (MTTR).

Automation and Tooling

Toil is dangerous. Manual tasks are onerous and repetitive and frequently error-prone. Google states that:

“Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features.”

That engineering work is automation. It makes sense not to build things that have already been built before, so if there are tools available for, say, monitoring, particularly if they are open source, like Prometheus or Grafana for example, teams should use them.

The SRE goal of reducing toil, in the context of observability, can be met by reducing the load associated with consuming and interpreting the data associated with monitoring system performance and reliability - something AIOps resolves. The goal of adding service features can be met both by injecting observability capability into existing products and by providing capability-rich observability platforms to the teams, such as AIOps. These services then also help the teams meet their SLOs and avoid using their error budgets.

The service desk needs to be part of the DevOps pipeline and toolchain. Data discovered here, that affects value realization, should be fed back into the product backlog to influence the next iteration of innovation.

How Observability Helps Organizations Adopt DevOps and SRE Practices

Embracing observability supports SREs’ goals by:

  • Reducing the toil associated with incident management - particularly around cause analysis - improving uptime and MTTR
  • Providing a platform for inspecting and adapting according to SLOs and ultimately improving teams’ ability to meet them
  • Offering a potential solution to improve when SLOs are not met and error budgets are over-spent
  • Relieving team cognitive load when dealing with vast amounts of data - reducing burnout
  • Releasing humans and teams from toil, improving productivity, innovation and the flow and delivery of value
  • Supporting multifunctional, autonomous teams and the “we build it, we own it” DevOps mantra
  • Completing the value stream cycle by providing insights around value outcomes that can be fed back into the innovation phase

What to Do Next

Moogsoft is the AI-driven observability leader that provides intelligent monitoring solutions for smart DevOps. Moogsoft delivers the most advanced cloud-native, self-service platform for software engineers, developers and operators to instantly see everything, know what’s wrong and fix things faster.

About the author


Helen Beal

Helen Beal is a DevOps and Ways of Working coach, Chief Ambassador at DevOps Institute and an Ambassador for the Continuous Delivery Foundation. She provides strategic advisory services to DevOps industry leaders and is an analyst at Accelerated Strategies Group. She hosts the Day-to-Day DevOps webinar series for BrightTalk, speaks regularly on DevOps topics, is a DevOps editor for InfoQ and also writes for a number of other online platforms. Outside of DevOps she is an ecologist and novelist.

All Posts by Helen Beal

Moogsoft Resources

May 5, 2022

More Tools + More People = Increased Complexity

April 7, 2022

Episode 4: Mooving to… Successful Engineering in the Remote World

March 24, 2022

Continuous Availability: How It’s Changed, and Why It’s Critical

February 15, 2022

AIOps in 2022 and Beyond: A Conversation with Gartner