In our inaugural State of Availability Report, we discovered that not only do metrics matter but the way we use them also does. Our research found that teams with fewer KPIs were more likely to meet their Service Level Agreements (SLAs) and provide their customers with higher levels of availability.
The problem with having too many KPIs is that they cause information overload and noise. That means the teams accountable for fixing problems and incidents are overwhelmed with data and then suffer from decision fatigue. As Kai Wang, Divisional CIO, Silicon Valley Bank puts it:
“It’s not the tools that help drive culture, it’s the data. Data don’t lie. But if you’re not careful, you can be drowning in data and won’t be able to see the needle in the haystack.”
Decision fatigue can result in poor choices as individuals make mental shortcuts in their decision analysis.
The right metrics are the ones that show teams how they are performing and where they can reduce unplanned work. The purpose of focusing on lowering unplanned work is to increase the capacity to invest in the platforms—making them more stable, sustainable, and scalable. And to release more time for teams to invest in creating more value outcomes for their customers and improving customer experience.
SLAs are a given—and Service Level Objectives and Service Level Indicators along with error budgets help manage teams’ performance within the parameters agreed with partners and customers. And teams with the highest levels of SLAs (five nines) have the fewest KPIs—and are most likely to be meeting their SLAs. These teams also rely on error budgets much more than teams working with lower SLAs.
SLA’s have been well-established since the beginning on information technology, but the popularity today of SLOs and SLIs is associated with the market adoption of Site Reliability Engineering (SRE), a set of practices initially developed by Google. These four KPIs work together in this way in the context of availability where typically a certain amount of downtime is permitted across a defined period and can typically show up as the sum of several problems or incidents:
- SLA: A formal agreement that describes how much downtime the customer will tolerate before there are—usually financial—repercussions
- SLO: A lower number the team choose to create a buffer so they don’t breach the SLA
- SLI: A continually tracked metric that indicates where the SLO might be broken
- Error budget: Suppose a payment service has an SLA of 98%, then the SLO must be higher. Considering an SLO of 99% availability, the error budget would be 1%. That 1% in a 28-day window is 3.65 days of downtime. If, after 15 days, the SLI is 99.5%, then you’re meeting your SLO and within your EB. If the SLI dips below 99%, then you’ve used up all of your EB and are no longer meeting the SLO.
So error budgets are undoubtedly useful for ensuring that services are up and running as people expect and prioritizing problems/incidents around other work, so they tell us something about a team’s performance—but they tell us very little about where improvements can be made.
When there is a problem or an incident, activity falls into broadly one of two types:
- Discovering the incident and its cause
- Resolving the issue and repairing and recovering the system
Enter the MTTXs—a slew of metrics relating to the Mean Time to do something that’ll get the service back up and running. You can check out the full report for a more detailed analysis here, but our research has shown that there are two mean-time metrics that really matter:
- Mean Time to Detect (MTTD)
- Mean Time to Recovery (MTTR)
We found that very few teams are measuring MTTD today and yet this relates to 66% of their MTTR. And it’s also one that can be quite easily reduced using AIOps that takes all the data streaming from the monitoring tools and reduces the noise using correlations and patterning techniques to quickly pinpoint the root causes of the issue.
This is a significant chunk of time that teams can claim back from unplanned work. That’s time that can be used to pay down technical debt, automate toil, and experiment with chaos engineering—things that directly improve the underlying stability of the system and improve future performance so availability doesn’t exist on such a knife-edge. It’s time that teams can invest in innovative platform improvements or new features that enhance customer experience. Available systems with rich customer experience—with the right business models—lead to high-performing organizations. So, when you’re choosing your availability metrics:
- Only pick a few—and make sure they include error budgets (to keep you on track) and MTTD (to help you claim time back)
- Make sure your teams aren’t drowning in data—and we also found teams have too many monitoring tools and spend too much time monitoring—use AIOps to reduce their burden on their cognitive load limits
- Measure work distribution—not just unplanned work but also what’s invested in paying down technical debt and automating toil and what the consequences of this investment are