The ultimate goals of IT Ops and DevOps are to enable rapid delivery of new products and features while maintaining high service quality and availability. In order to achieve these goals, IT Ops and DevOps teams are aggressively measured against some core KPIs. Team structure and processes vary across organizations, yet the KPIs tend to be ubiquitous.
To measure the success of deployments, some common KPIs include:
- Deployment Frequency
- Incident Frequency
- Customer Ticket Volume
And when a failure does occur, common KPIs include:
- Mean-Time-To-Detect (MTTD) or Mean-Time-To-Identify (MTTI)
- Mean-Time-To-Bridge (MTTB)
- Mean-Time-To-Resolve or Restore (MTTR)
In my interactions with IT Ops and DevOps teams at large enterprises over the last several months, I’ve noticed the adoption of a new KPI that sits between MTTD and MTTR, and typically overlaps with MTTB. It’s called MTTC.
Mean-Time-To-Convince (MTTC): A Measure of Finger Pointing
MTTC is a measurement of how long it takes to convince a particular team or teams that they are responsible for an IT incident. While allocating incidents to the appropriate teams sounds trivial, the complexity of modern infrastructures combined with legacy tools and process actually makes it quite challenging, explaining why IT Ops and DevOps professionals are now being evaluated against this metric.
Detecting incidents in a timely fashion is a major challenge. This is supported by the heavily quoted Forrester statistic reporting that 74% of incidents are detected by customers before ops. From my experience speaking with large organizations, that statistic is fairly accurate. But once an incident is detected, what’s next? Create a ticket that gets passed around from team to team, each responding with “that’s not us”? Get all teams on a bridge call and start off by blaming network operations? Surely these sound familiar if you have a background in IT ops at a large organization.
According to Terry Slattery of NetCraftsmen in his recap blog of VoiceCon Orlando 2009, he discovered the term MTTC and heard it to be described as “the time that it takes the network team to convince the server, apps, or security team that the network is not at fault for some problem that the other team sees.” Furthermore, Slattery reported that up to 60% of the MTTR is due to the MTTC.
In the many years that have passed since then, a lot has changed and this issue appears to have intensified. MTTC is now a formally measured KPI at many organizations that have FY16-17 initiatives to reduce the number. On an exploratory call just last week with a frustrated IBM Netcool customer, I heard MTTC described as “the amount of time it takes the Network Operations team to manually pull together all the Alerts from Netcool to Convince the applications team that they own/need to fix the Situation at hand.”
Let Correlated Alerts do the Convincing for You
If you are approaching modern volumes of IT telemetry with rules and filters to reduce noise and correlate events/alerts, you are likely missing key pieces of information and are unaware of complex relationships between events/alerts from disparate systems. You likely have a very high MTTC.
By leveraging machine learning, you can benefit from data-driven models doing the heavy lifting of reducing alert noise and discovering anomalous features and complex relationships in real-time. With correlated groups of alerts that narrate the life cycle of an incident, the nature of any particular incident is immediately visible, along with the teams that need to be involved. In other words, the alerts presented in this fashion will do the convincing for you.
Will clusters of alerts be 100% accurate? No… Machine learning is not an exact science. However, with tools like Moogsoft AIOps that apply a collection of machine learning algorithms to understand features within the attributes of alerts, the accuracy is quite impressive and the productivity enhancements are astounding.
If you are struggling to detect incidents, convince the appropriate teams to take ownership, and finally fix those incidents, you need to investigate a modern, algorithmic approach to managing your IT telemetry.
About the author Sahil Khanna
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.