In this post, I’ll share their story with the hopes that it can provide direction and clarity for those of you who are facing similar challenges.“At Least 70% of Incidents were Detected by our Customers”Like most modern organizations, service quality and customer experience is a top priority. In order to gain visibility across their IT production stack and address incidents, this organization was using tools like SCOM, Splunk, Solarwinds, Cacti and various homegrown solutions that created email notifications for their operations teams to view. With roughly 15 people across the NOC, Systems Operations, Infrastructure, and Applications teams, managing incidents proactively was an absolute challenge.
A typical day involved their operations teams looking at thousands of email alerts, which was roughly 40% of the total alert volume being generated. The rest was turned off to avoid further alert storming. Of these email alerts, 300 to 400 tickets were created each week for the NOC team to manage, and 70% of those tickets were closed without any action taken. Furthermore, when a P1/P2 incident did occur, all-hands bridge calls were conducted. (Sound familiar?)
As a result of the sheer volume of noise and lack of context, it was taking the operations teams roughly two hours to detect incidents and two hours to resolve incidents. They were operating reactively, as over 70% of incidents were detected by customers first.
According to the NOC Manager, “Because there was such a high volume of alerts, we could only look at critical alerts when things were breaking. The ‘Lows’ and ‘Mediums’ that could be leading to problems would always be missed. It was like firefighting.”
“Because of SCOM’s server-level focus,” he continued, “it was very difficult to determine whether a larger part of the environment was being effected as a whole, since we were just concentrating on alerts coming in from one server.
After years of “firefighting,” they decided to evaluate Moogsoft.Modernizing IT Operations with Moogsoft
“We put Incident.MOOG into the wild and treated it as if it was in production, alongside SCOM, ” said the Director of Technology, speaking about their Moogsoft evaluation. They sent all event sources into Incident.MOOG, and the solution was calibrated to give the optimal event correlation.
At first, the operations teams were looking at both SCOM and Incident.MOOG. Today, they rely 100% on Incident.MOOG. Everything from SCOM, in addition to the rest of their toolset, are feeding into Incident.MOOG. Furthermore, Incident.MOOG is now a direct interface into their ticketing system.
According to NOC Manager, “Today, we are using the same tools, but the way in which we are using them has completely changed. We have turned on all alerts and are sending everything to Incident.MOOG for full visibility.”
Each day, Incident.MOOG ingests thousands of alerts and reduces and correlates them into ~80 actionable Situations (over 90% reduction!).
On top of that, the Director of Technology shared that, with Incident.MOOG, they have experienced:
- 30% reduction in customer identified incidents
- 75% reduction in Mean-Time-To Detect incidents
- 25% reduction in Mean-Time-To-Resolve incidents
“I Bought Moogsoft so that I can Sleep Better at Night”Moogsoft has allowed this customer to move from reactive to proactive incident management. All of their alerting is now centralized, automatically reduced and correlated, making operators more productive and effective as ensuring high service quality. When asked why he purchased Moogsoft, The Director of Technology explained, “I bought Moogsoft so that I can sleep better at night.”
About the author Sahil Khanna
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.