In the last article in this series on Service Reliability Engineering (SRE) and observability, Jason Bloomberg deconstructed the fundamentals of the SRE approach and why they are essential to the process of developing modern software. In it, he hinted at the indispensability of observability to quantify the reliability vs. innovation trade-off.
This innovative trade-off comes in the form of tension. It’s the tension between the need to rapidly deliver a continuous stream of software releases and updates and the impact that process can have on the reliability, availability, and performance of those applications.
As Bloomberg explained, managing this balance demands that organizations establish service level objectives (SLOs) that define acceptable service performance criteria and that they create an error budget that represents the difference between these SLOs and 100% performance.
As Bloomberg alluded to, the challenge is that for this process to work, SREs must be able to accurately and continuously quantify their performance and the expenditure of their error budget. But as many SREs are finding, that’s the devil in the details that can often leave them exposed.
Bloomberg explained that observability might be the answer to this challenge, but why and what do SREs need to understand to put it to work for them?
Observability vs. Monitoring — and Why It Matters to SREs
Before we can dive into the ins and outs of observability, we need to address the monitoring elephant in the room. The first questions that tend to surface when it comes to observability in the SRE context are, “Isn’t monitoring sufficient?” or “Isn’t observability just a fancy new term for monitoring?”
The answer to both of these questions is a resounding “No!,” but it’s clear that many SREs aren’t getting the message. In fact, according to some reports, as few as 50% of SREs are currently leveraging observability in their practice.
Here’s the difference. Monitoring is based on the idea that you can pre-determine the potential areas of concern within a technology stack and then instrument those areas so that you can monitor them. It sounds great. And in the traditional, mostly-static environments of the past, it worked fine.
The problem is that in today’s cloud-native, microservices-fueled, and constantly changing environments, it’s almost impossible to predict what data you’re going to need or what areas may be your source of issues in the future. The complexity and ephemerality of today’s environments make monitoring an imperfect approach to collecting data — and one that requires far too much overhead.
This very realization led to the development of the concept of observability in the first place. At its core, it is flipping the monitoring approach upside down.
Rather than trying to figure out what you’ll need to know in advance, observability is all about collecting external outputs of a system — its events, logs, metrics, and traces — and using that data to infer its internal state, then using that data to manage the environment.
Fundamentally, it’s about creating a continuous and sustainable data feed from your systems that will allow you to deal with the unknown unknowns. It’s an inherently different approach that helps SREs close their two most significant gaps.
How Observability Closes Both the Data and Experiential Gap
The greatest challenge for SREs — particularly in the enterprise context — is that they must grapple with two significant gaps that hinder their ability to fulfill their mission. The first of these is the data gap.
Even when we’re talking about so-called greenfield development efforts, organizations deployed many of the dependent systems long before the industry even conceptualized things like cloud-native, DevOps, or CI/CD. As a result, they did not architect these systems to deliver the telemetry necessary for SREs to do their job.
Ensuring reliability, availability, and performance for a complex and continually changing technology stack demands a steady stream of data — something that traditional approaches to monitoring just cannot deliver.
But SREs need the ability to close this data gap to effectively measure performance against their SLOs and monitor their error budget. Observability is an important mechanism that helps to close this gap. By establishing telemetry, SREs get the actionable data they need to do their job.
However, it’s more than just the data gap that needs closing. It’s also the experiential gap.
Traditional monitoring approaches are, by nature, systems-centric. They can trace their roots to a period in which systems were managed in isolation with little regard to their interactions with other systems, and definitely without a clear focus on the total experience of the ultimate customer or user.
A major thrust behind the collective movement toward cloud-native, DevOps, and the entire concept of the SRE is the need to focus on and manage against the totality of the user experience that the collection of applications and their supporting technologies deliver. “Your SLOs will be target values for corresponding service level indicators (SLIs), which are the measurements of the critical parts of the end-user experience,” explained Google (which created the concept of the SRE).
Observability came into existence as part of this movement and is inherently experience-centric as a result. This centricity comes in two forms. First, focusing on telemetry rather than attempting to pre-determine the source of problems provides a way to manage the unknown unknowns.
Second, and perhaps most importantly from an experiential perspective, it enables SREs to go beyond merely identifying a problem and use this same telemetry to determine why it has happened to address the underlying issue.
Ultimately, observability becomes the essential tool that enables SREs to manage the delivery of the intended experience.
The Intellyx Take: The Interdependency Between SREs and Observability
To be clear, simply purchasing an observability platform and turning it on will not make systems more reliable. In the end, observability platforms merely provide data. Hopefully, it is highly correlated, contextualized, and actionable data — but it’s just data nonetheless.
It is the job of the SRE to do something with that data. That process must begin with the effective establishment of SLIs, SLOs, and the error budget. If they have not established those elements appropriately, no amount of data will improve anything.
But the corollary is also true. Merely establishing rock-solid performance parameters will be meaningless if an SRE cannot effectively manage against them because they lack the necessary data to do so.
This situation is where many SREs — particularly the half that are not presently using any form of observability — find themselves today.
And even among SREs that have begun deploying observability platforms, many are not focusing enough on the correlation and contextualization of the data they are producing.
This fact is why deploying an observability platform is the first step, but not the end game. The ultimate goal is to improve uptime, effectively manage the consumption of the error budget and reduce the toil on SREs — and that requires a focus on automation. The next article in this series will dig into how SREs can take observability data to the next level by leveraging automation.
Copyright © Intellyx LLC. Moogsoft is an Intellyx client. None of the other companies mentioned in this article are Intellyx clients. Intellyx retains full editorial control over the content of this paper.