Demystifying Availability KPIs — and What Most Companies Miss
Richard Whitehead | November 16, 2022

Most engineering teams are no strangers to key performance indicators (KPIs), those metrics tracking progress toward critical goals and targets. Ideally, tech leaders design KPIs to focus teams on what matters and prove their contribution to the company’s overall performance. Of course, KPI data should also uncover critical information that guides informed decision-making.

For engineering teams tasked with managing the customer experience, KPIs often track availability. But which metrics do teams use to measure their availability? Do these KPIs actually help performance? And, critically, what do companies miss?

To demystify availability KPIs, the Moogsoft team launched its inaugural State of Availability Report. Here are some of the sometimes surprising findings about modern-day availability KPIs:

Teams spend a lot on availability — with little result

Teams spend most of their time on monitoring, and organizations invest in an average of 16 monitoring tools (and up to 40). Still, KPIs show that availability outcomes are not where they should be. In fact, 45% of customers notify teams about issues before their tools do. Why aren’t teams and tools stepping in faster to preserve the customer experience? Engineers are likely monitoring too many tools and piecing together insights from mountains of siloed data.

Solution: Companies should assess their monitoring tools to determine what exactly these tools are covering. IT leaders should make sure their monitoring tools provide a complete picture of system health, looking for overlaps, gaps and future optimizations. Additionally, leaders should measure how often customers catch incidents before tools do — and work to reduce that number.

Most teams breach their SLAs

Despite a significant investment in availability, 25% of companies miss their service level agreements (SLAs). Interestingly, teams with higher SLAs — which tend to be teams at larger companies — meet them more often than teams with lower SLAs. This outcome could be due to the fact that bigger companies tend to employ dedicated IT Operations teams and use platforms and services purpose-built for incident management.

Solution: Because breached SLAs lead to negative customer experiences, poor organizational performance and unsatisfied employees, tech leaders must take immediate, proactive measures to help teams prevent incidents, fix them faster and meet their SLAs. Artificial intelligence for IT Operations (AIOps) solutions can catch incidents before they impact the end user and automate the incident lifecycle for rapid mitigation.

Error budgets are the most popular availability KPI

Error budgets, the time a system can fail without counting against the SLA, are the most tracked availability KPI among small- and medium-sized companies and those enterprises with more aggressive SLAs. While error budgets are somewhat helpful measurements in explaining that teams missing targets, they fail to explain the why teams missed their targets.

Solution: Tech leaders should focus availability KPIs on mean time to recover (MTTR) and mean time to discovery (MTTD), which explain the specifics behind missed targets. Then, leadership can set objectives for reducing both metrics.

Fewer KPIs and higher SLAs produce the best outcomes

Clearly, higher SLAs are tougher to meet. But teams with tougher standards meet them more regularly. It’s likely that teams with fewer, more meaningful metrics can focus their time on attaining clear goals and avoid decision fatigue caused by information overload. From a leadership perspective, more precise information can be more easily incorporated into decision-making.

Solution: Tech leaders should narrow the focus of their KPIs, raising overall standards and eliminating less significant metrics.

Teams do not measure 66% of incident downtime

While most teams focus their availability KPIs on MTTR, fewer than 15% measure MTTD. That’s a significant problem. On average, MTTD takes about an hour — twice the amount of time needed for incident resolution. In other words, most teams simply do not measure 66% of their incident downtime, providing inaccurate data about the average incident lifecycle. Additionally, inaccurate data can hinder necessary investments in teams and tools, slow long term availability improvements and hide unplanned work.

Solution: Tech leaders must reevaluate KPIs, measuring the end-to-end incident lifecycle from detection through resolution. Focusing on MTTD and MTTR will help IT teams get an accurate picture of the incident lifecycle so that they can ultimately improve their availability.

Based on the State of Availability Report, organizations and teams have room to optimize their KPIs to improve availability and, ultimately, the customer experience. After leaders evaluate their current KPIs, they must also evaluate their tools. Are teams’ existing tools helping teams meet their KPIs? An AIOps solution can address many of the issues identified, providing early incident detection, automating collaboration for quick incident response and remediation and preventing destructive patterns from becoming service-impacting incidents.

Interested in digging deeper into availability KPIs? Watch the recently released “Engineering KPIs: How to Align Executive Strategy with Team Flow” with DevOps industry experts who discuss the benefit of fewer metrics, what metrics matter most and how to align goals.

Moogsoft is the AI-driven observability leader that provides intelligent monitoring solutions for smart DevOps. Moogsoft delivers the most advanced cloud-native, self-service platform for software engineers, developers and operators to instantly see everything, know what’s wrong and fix things faster.
See Related Posts by Topic:

About the author


Richard Whitehead

As Moogsoft's Chief Evangelist, Richard brings a keen sense of what is required to build transformational solutions. A former CTO and Technology VP, Richard brought new technologies to market, and was responsible for strategy, partnerships and product research. Richard served on Splunk’s Technology Advisory Board through their Series A, providing product and market guidance. He served on the Advisory Boards of RedSeal and Meriton Networks, was a charter member of the TMF NGOSS architecture committee, chaired a DMTF Working Group, and recently co-chaired the ONUG Monitoring & Observability Working Group. Richard holds three patents, and is considered dangerous with JavaScript.

All Posts by Richard Whitehead

Moogsoft Resources

November 30, 2022

How to Help Teams Create Optimal Infrastructure for Availability

November 29, 2022

Just Maintaining Availability? Try Building Stability

November 21, 2022

A Fireside Chat with Phil Tee, CEO of Moogsoft

November 16, 2022

Demystifying Availability KPIs — and What Most Companies Miss