This blog post defines SRE by explaining SLOs and error budgets, highlighting the innovation vs. reliability tradeoff.
The most striking difference between modern enterprise software development and the practices of the past is the increasing focus on the importance of deployment velocity.
Where a monthly (or slower) release cadence was considered routine, today’s enterprises find that the fast pace of innovation is driving an increasing proportion of their software initiatives to have daily or even hourly releases.
Such velocity requires a rethink of every aspect of the software development lifecycle – in particular, operations. Operators must maintain operational priorities centering on availability and reliability in the context of rapid deployment cadences across the IT landscape.
Operators must therefore weigh the tradeoffs between reliability and availability on the one hand and deployment velocity on the other – while simultaneously managing costs.
Measurable best practices for managing such tradeoffs are at the heart of site reliability engineering (SRE), and its focus on the error budgets that represent such tradeoffs.
Defining Service-Level Objectives (SLOs)
Increasing deployment velocity requires that IT leadership shift its attention toward a business expectation model that enables the business to define targets that match IT delivery models to business outcomes.
In order to accomplish this expectation model, organizations require a formal statement of users’ expectations about any particular dimension of reliability: what we call the service-level indicator, or SLI. The SLI is the proportion of valid events that were good, expressed as a percentage.
In this context, ‘good’ can refer to availability, latency, the freshness of the information provided to users at the user interface, or other key performance metrics that are important to the business. For example, an SLI might state that 99.9% of valid requests for the page index.html were successful (returned a 200 ‘OK’ HTTP code).
Each SLI provides a guideline for each dimension of reliability an organization wants to observe and measure for a given user journey. Once the ops team has specified the SLIs important to the business, they must make the appropriate decisions about measurement and validity, essentially classifying which events are ‘good.’
The Service Level Objective (SLO) for a system, in turn, is a precise numerical target for any such dimension – the availability of the system, for example. To define your SLO, start with your SLIs. Make sure they have an event and success criterion. The SLO, in turn, specifies the proportion of SLI events that were good.
For example, your SLO might state that 99.9% of the valid requests for the page index.html over the last 30 days returned a 200 ‘success’ code in 150 milliseconds or less.
What to Do with your SLOs
Once you’ve defined your SLOs, the SRE team should frame all discussions about whether a particular system is running sufficiently reliably in terms of whether the system is meeting its SLOs.
In particular, it’s always important to frame this discussion in terms of cost. After all, the more reliable the service, the more expensive it is to operate. Based upon this calculation, define the lowest level of reliability for each service that the business is willing to pay for in order to set its desired SLO for the service.
Secondly, real-time measurement of SLOs, including all the parameters that they are based on, is critically important to a successful SRE. As the adage says, you can’t manage what you can’t measure – and given the velocity of modern software development, such measurement must be both comprehensive and real-time.
Based upon the SLO and observability-based measurements, therefore, the ops team and its stakeholders can make fact-based judgments about whether to increase a service’s reliability (and hence, its cost), or lower its reliability and cost in order to increase the speed of development of the applications leveraging the service.
While site reliability is a good thing, this focus on SLOs will prevent your team from making services overly reliable – a critical mistake that can increase costs and impede development.
Focus on the Error Budget
Once the team gets its collective heads around the fact that perfect reliability is both unattainable and undesirable, the question then becomes just how far short of perfect reliability should it aim for. We call this quantity the error budget.
The error budget represents the number of allowable errors in a given time window that results from an SLO target of less than 100%. In other words, this budget represents the total number of errors a particular service can accumulate over time before users become dissatisfied with the service.
No matter how reliable the infrastructure, there is always a risk that something will go wrong. Error budgets are an effective approach to quantifying and managing this risk.
The reason the error budget is so important is that it represents the tension between the fast pace of development (and hence innovation) vs. the need for service reliability. In essence, the organization can ‘spend’ the error budget, but should never exceed it.
The Intellyx Take
Enterprise technology has always been a matter of tradeoffs. As organizations move toward cloud-native architectures and DevOps-driven development cultures, these tradeoffs become an important part of day-to-day operations.
SLOs and their corresponding error budgets give ops teams a quantifiable approach for balancing the reliability vs. innovation tradeoff.
This quantifiability, in turn, depends upon observability. Stay tuned for the next installment of this blog series for a deeper explanation of how observability helps the SRE team.
Copyright © Intellyx LLC. Moogsoft is an Intellyx customer. Intellyx retains final editorial control of this article.
About the author
Jason Bloomberg is a leading IT industry analyst, author, keynote speaker, and globally recognized expert on multiple disruptive trends in enterprise technology and digital transformation. He is founder and president of Digital Transformation analyst firm Intellyx. He is ranked among the top nine low-code analysts on the Influencer50 Low-Code50 Study for 2019, #5 on Onalytica’s list of top Digital Transformation influencers for 2018, and #15 on Jax’s list of top DevOps influencers for 2017. Mr. Bloomberg is the author or coauthor of five books, including Low-Code for Dummies, published in October 2019.