In our research for the inaugural State of Availability Report, we asked 1,900 engineers about mean time to detect (MTTD) and mean time to recovery (MTTR) as two leading incident management Key Performance Indicators (KPIs) strongly associated with availability.
We learned that less than 15% of respondents are tracking their MTTD. It takes twice as long to discover an issue than it does to resolve it. Furthermore, 80% of respondents said that they aren’t tracking their MTTR (the average incident lifecycle is ninety minutes and most respondents are missing their SLAs). That adds up to an hour lost every time there’s an incident. Gartner estimates the cost of unplanned downtime to be around $5,600 per minute and that is over $300K per hour!
That’s a lot of unplanned work that’s not visible. Peter Drucker reputedly said, “If you can’t measure it, you can’t improve it.”
Our data suggest that a smaller number of KPIs correlates with higher availability and Mean Time to Resolve (MTTR). However, adding Mean Time to Detect (MTTD) means that teams can see the end-to-end incident lifecycle and prioritize ways to reduce it.
There are complex ways to calculate MTT(x) but we know from this research that limiting the number of KPIs has a direct relationship with achieving higher levels of availability. We recommend then that teams focus on MTTD and MTTR—these are the most meaningful KPIs and the easiest to measure. Reducing the time spent dealing with an incident releases time to spend on improving platforms and services and reducing the volume of incidents moving forward.
You can find a table of other “mean time” incident metrics in the glossary below. Here are our top 6 tips for improving your MTTx KPIs—
- Make sure you understand the definitions of all the KPIs available to you—and that understanding is shared across your team and organization
- Understand how the available KPIs align with business goals (short and long term)
- Pick a small number of KPIs and focus hard on them—ensure they are instrumented so teams don’t spend time looking for them, calculating them, and reporting on them—they need to be available on at least a day-to-day basis
- Use KPIs actively to identify and measure improvement opportunities that result in more time being made available for teams to invest long-term in customer experience
- Look for instrumentation and tools that do more than just monitor and alert—look for tools that provide insights that are hard for a human to find on their own
- Accept that tools need constant maintenance—they need to be correctly configured, and tweaked as conditions around them change—there is an overhead with most tools (and/or find a tool that monitors the monitoring i.e. AIOps)
Acronym | Short for | Definition |
MTBF | Mean Time Between Failures | Measures the ability of a system or component to perform its required functions under stated conditions for a set amount of time; the elapsed time between system failures during everyday operations. |
MTTA | Mean Time To Acknowledge | The average time it takes from when an alert is triggered to when work begins on the issue. |
MTTD | Mean Time To Detect (Discover) | The time between the onset of an incident and its discovery. Or, the time spent discovering the cause of an incident, prior to starting to implement the repair. |
MTTF | Mean Time To Failure | The average amount of time a defective system can continue running before it fails. Time starts when a serious defect in a system occurs, and it ends when the system completely fails. MTTF is used to monitor the status of non-repairable system components and analyze how long a component will perform in the field before it fails. |
MTTR | Mean Time To Recover (Restore) | The time spent getting an application back into production following a performance issue or downtime incident. This includes the full time of the outage—from the time the system or product fails to the time that it becomes fully operational again. |
MTTR | Mean Time To Repair | The average time it takes to repair a system including both the repair time and any testing time. |
MTTR | Mean Time To Resolve (Resolution) | Mean time to resolution addresses the time required to fix a problem and to implement subsequent “cleanups” or proactive steps designed to keep the problem from recurring. Teams should address both of these tasks before they can declare an issue resolved. |
MTTR | Mean Time To Respond | The average time it takes to recover from a product or system failure from the time of the first alert. This doesn’t include any lag time in the alert system. |
For more data insights, insights from industry leaders, and actionable insights, check out the full State of Availability Report!
Happy reading!
Moogsoft is the AI-driven observability leader that provides intelligent monitoring solutions for smart DevOps. Moogsoft delivers the most advanced cloud-native, self-service platform for software engineers, developers and operators to instantly see everything, know what’s wrong and fix things faster.