Contextualizing Network Outages and the End-User Impact
Sahil Khanna | March 1, 2016

When massive connectivity issue struck, Moogsoft facilitated recovery by contextualizing alerts & focusing on the real issue.

When massive connectivity issue struck, Moogsoft facilitated recovery by contextualizing alerts & focusing on the real issue.

Moogsoft was recently deployed by one of the USA’s largest healthcare systems and the results were so compelling that I had to write about it.

This customer is well instrumented in terms of their production IT monitoring and has certainly been receiving voluminous events and alerts in response to incidents. During the initial validation period, MOOG was configured to receive all of their event streams, i.e. from Microsoft SCOM, SolarWinds, VMware, Citrix XenApp and Nexthink.

The incident that could have been avoided

Within the first days of deploying Moogsoft, this customer realized an IT incident where 96 end-user compute platforms (desktops/laptops) experienced severe intermittent network and web connectivity issues for over 2 hours.

Incident.MOOG, which was monitoring behind the scenes, notified this customer that it found a single Situation (a cluster of correlated alerts) containing exactly 150 alerts. The alerts consisted primarily of SolarWinds ‘critical’ alerts as well as Nexthink ‘major’ alerts.

You can see a range of alerts clustered in this Situation indicating slow network response times, excessively high network device temperatures, failure to read hardware sensors, poor web connection issues, and a variety of other related alerts. Just looking at this single list of clustered alerts makes the incident quite clear to understand.

IT Events can tell a story

Below is Moogsoft’s timeline view of the Situation that was detected. This is an incredibly powerful tool for visualizing how an incident unfolds in real-time, showing the sequence and distribution of the related alerts. The x-axis represents time and the y-axis represents the unique alerts that were fired from Solarwinds and Nexthink.

On the right hand side, you can see the latest occurrence of these network devices overheating during core business hours at this particular operational location (2:49-4:49 PM CST). The extended yellow line going downward represents the impacted end-user compute devices.

Moogsoft allows you to drill down into the individual occurrences of the alerts and their underlying events. In the below screenshot, you can see the red dots depicting the network devices going critical for 30 minutes, clearing briefly (green dots), and then going critical again.

As you scroll down through the timeline view, you begin to see individual desktop/laptop devices that were impacted directly by these overheating network bridges, experiencing high latency and web connection rate failures, as high as 75%.

So what does this all mean?

Despite the fact that this customer had a comprehensive set of monitoring tools in place to alert network and infrastructure events when thresholds were exceeded, and anomalies were observed, they failed to contextualize it all and identify the root cause quickly, resulting in 2 hours+ of end-user negative impact.

With Moogsoft, the finger-pointing was eliminated, as the network team was able to immediately see that a group of old devices had reached end of life and were now running hot, beyond allowable environmental conditions.

Multiple IT Ops teams were able to see all of the alert clustering immediately, which had been done automatically across multiple technology silos, quickly revealing the source of this detrimental incident. If this customer’s Ops teams had been using Moogsoft in production on this day, they could have avoided the 2 hours+ of end-user impact. Now that’s a prescription for preventive heathcare!

Such stories demonstrating Moogsoft’s immediate value to IT Ops and DevOps teams are quite common, so stay tuned as we write about some more.

Moogsoft is a pioneer and leading provider of AIOps solutions that help IT teams work faster and smarter. With patented AI analyzing billions of events daily across the world’s most complex IT environments, the Moogsoft AIOps Platform helps the world’s top enterprises avoid outages, automate service assurance, and accelerate digital transformation initiatives.
See Related Posts by Topic:

About the author

Sahil Khanna

Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.

All Posts by Sahil Khanna

Moogsoft Resources

August 4, 2020

Telemetry Everywhere: Observability in the DevOps Cosmos

July 22, 2020

What’s Observability with AIOps? Check Out Our New Book, Webinars and Infographic

July 21, 2020

Why Observability Matters to Site Reliability Engineers

June 29, 2020

Moogsoft Express Helps DevOps and SRE Teams Develop More and Operate Less