Moogsoft helped this organization reduce alert storming by 99%, directly increasing L1 operator productivity by 10x.
This week I had the opportunity to speak with a Systems Manager from one of our customers, a leading global insurance provider. This company had an interesting story to tell about their incident management journey and how they got to where they are today.
It was shocking to hear that just 3 years ago, this organization didn’t really have a monitoring platform or any tools in place. This was because much of their environment was outsourced for cost-saving measures. The turning point occurred when new executives came onboard with service management backgrounds and allocated budget towards monitoring, in response to a few major incidents.
This organization had two major business units that eventually went off in different directions. Each unit operated separately and had uniquely defined their tooling standards (problem #1). They began investigating tools for end-to-end monitoring, from the customer journey all the way to the application and infrastructure state. After evaluating many solutions, each unit strategically chose a collection of tools including like AppDynamics, Splunk, Pingdom, and BMC EUM, which were selectively deployed across the two business units.
IT Monitoring is Easier Said Than Done
What’s really interesting is that most of the alerts fired off by these tools were actively disabled because operators became overwhelmed with the sheer volume that was created (problem #2). Alerts were killing level 1 operator productivity. Furthermore, level 1 operators didn’t have any access to network or infrastructure alerts because these stacks were both managed externally by service providers (problem #3).
While their monitoring approach was obviously flawed, there was a day when a single event occurred that made them realize that change needed to be made. Two operators were sitting together in the same room. One was focused on a problem within AppDynamics and the application, saying that they lost connectivity to a service running in one of their data centers. The other person was working on a separate network issue where they lost one of the firewalls in the data center. It took over 30 minutes for them to realize that the two issues were in fact related. It turns out that the firewall being down had blocked application server connectivity in the datacenter.
The Need for a Single Pane of Glass
It became clear that there was an urgent need for an additional tool that could work across each separate domain to ensure that professionals could gain a holistic view of their applications, network and infrastructure.
Already being a BMC customer, this organization naturally planned to implement BPPM as the integration layer for all of their monitoring tools. After seeing a massive implementation bill from BMC, however, along with incredibly complex maintenance requirements, they decided to look elsewhere.
They knew that they needed a solution which provided immediate value. For example, the Systems Manager who I spoke with felt that AppDynamics gave 75% of its full value out-of-the-box. After evaluating several tools, this organization chose Moogsoft because of a few core capabilities:
- Event reduction
- Anomaly detection
- Event correlation
- Collaborative remediation
And of course, the minimal configuration & maintenance requirements that had previously put them off BMC.
With Moogsoft in place, this organization was able to turn on ALL event streams and allow level 1 operators to scale without increasing the number of operators to manually analyze alerts. On top of providing a bigger picture by correlating all event streams, Moogsoft was able to reduce alert storming by 99%, which directly increased level 1 operator productivity by 10x! And instead of using email to notify each other of incidents, they were now able to leverage Moogsoft to automatically notify the right people at the right time to get involved in each incident.
It’s great to see how the right selection of tools can enable an organization to transform their incident management approach. Alternatively, it’s shocking to see how inhibiting the wrong selection can be to operations productivity and the impact it can have to the business.
I hope that this story resonates with those organizations that have been through this journey themselves, while providing insight to those who are in the process.
About the author
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.