When massive connectivity issue struck, Moogsoft facilitated recovery by contextualizing alerts & focusing on the real issue.
Moogsoft was recently deployed by one of the USA’s largest healthcare systems and the results were so compelling that I had to write about it.
This customer is well instrumented in terms of their production IT monitoring and has certainly been receiving voluminous events and alerts in response to incidents. During the initial validation period, MOOG was configured to receive all of their event streams, i.e. from Microsoft SCOM, SolarWinds, VMware, Citrix XenApp and Nexthink.
The incident that could have been avoided
Within the first days of deploying Moogsoft, this customer realized an IT incident where 96 end-user compute platforms (desktops/laptops) experienced severe intermittent network and web connectivity issues for over 2 hours.
Incident.MOOG, which was monitoring behind the scenes, notified this customer that it found a single Situation (a cluster of correlated alerts) containing exactly 150 alerts. The alerts consisted primarily of SolarWinds ‘critical’ alerts as well as Nexthink ‘major’ alerts.
You can see a range of alerts clustered in this Situation indicating slow network response times, excessively high network device temperatures, failure to read hardware sensors, poor web connection issues, and a variety of other related alerts. Just looking at this single list of clustered alerts makes the incident quite clear to understand.
IT Events can tell a story
Below is Moogsoft’s timeline view of the Situation that was detected. This is an incredibly powerful tool for visualizing how an incident unfolds in real-time, showing the sequence and distribution of the related alerts. The x-axis represents time and the y-axis represents the unique alerts that were fired from Solarwinds and Nexthink.
On the right hand side, you can see the latest occurrence of these network devices overheating during core business hours at this particular operational location (2:49-4:49 PM CST). The extended yellow line going downward represents the impacted end-user compute devices.
Moogsoft allows you to drill down into the individual occurrences of the alerts and their underlying events. In the below screenshot, you can see the red dots depicting the network devices going critical for 30 minutes, clearing briefly (green dots), and then going critical again.
As you scroll down through the timeline view, you begin to see individual desktop/laptop devices that were impacted directly by these overheating network bridges, experiencing high latency and web connection rate failures, as high as 75%.
So what does this all mean?
Despite the fact that this customer had a comprehensive set of monitoring tools in place to alert network and infrastructure events when thresholds were exceeded, and anomalies were observed, they failed to contextualize it all and identify the root cause quickly, resulting in 2 hours+ of end-user negative impact.
With Moogsoft, the finger-pointing was eliminated, as the network team was able to immediately see that a group of old devices had reached end of life and were now running hot, beyond allowable environmental conditions.
Multiple IT Ops teams were able to see all of the alert clustering immediately, which had been done automatically across multiple technology silos, quickly revealing the source of this detrimental incident. If this customer’s Ops teams had been using Moogsoft in production on this day, they could have avoided the 2 hours+ of end-user impact. Now that’s a prescription for preventive heathcare!
Such stories demonstrating Moogsoft’s immediate value to IT Ops and DevOps teams are quite common, so stay tuned as we write about some more.
About the author
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.