While most Major Incidents appear to happen instantly, the vast majority have a beginning, a middle and an end. Most operators during a P-1 Incident are faced with an alert storm containing several hundred, even thousands of events. This is why your alert inbox currently has 345,343 unread emails. Yep, you know what I’m talking about.
Understanding the full narrative behind a Major Incident requires two important things:
- Complete event coverage of all your applications, services and infrastructures
- Event Correlation across of all these event sources
Without these two things, you’ll be putting together a rather large jigsaw puzzle that has missing pieces and broken edges. The bad news is that you don’t have hours to put the puzzle together, you have seconds.
Moogsoft is here to help.
Incident Lifecycle Visualized through the Situation Timeline
In order to make incident troubleshooting as easy as possible, Moogsoft now offers a new way of looking at events and alerts. Moogsoft has developed a timeline visualization to show exactly how an incident has unfolded; it’s an easy format that everyone can understand.
Instead of looking at tabular log of a thousand alerts, we show you exactly what happened first and the cascade effect that followed.
In the timeline, the x-axis indicates time and the y-axis indicates unique alerts.
For those unfamiliar with the concept of a Situation (unique to Moogsoft), Moogsoft uses machine learning to identify relationships across alerts and then cluster them as a group (from across your IT production stack) to create the full narrative of an incident, beginning to end. These clusters = Situations.
Each timeline represents an individual Situation and the alerts that comprise it.
Let’s look at a real example from one of our customers, represented by the timeline above and broken down below. The Incident involved an application failure caused by a Message Queue filling up. This Situation occurred over the period of 1 hour and consists of 187 unique alerts.
The evolution is roughly broken into 4 major sections. The 5th being hypothetical operator activity.
- Stage 1 is the Message Queue filling up. These alerts were ignored in the customer’s Netcool console because they were considered insignificant and not actionable. Additionally, for a significant amount of time, these alerts disappeared from the Netcool view because they toggled up and down across the alert threshold.
- Stage 2 is likely where a Moogsoft user would address the Situation, as the scope increased from just one Alert type to multiple. Moogsoft works in real-time so the appropriate stakeholders would be immediately notified. Without Moogsoft, you would be looking at these as separate alerts and wouldn’t have the proper context to identify the Incident.
- In case you missed stage 2, stage 3 is a great opportunity to avert disaster due to the timeline indicating another increase in scope due to collateral alerts. Again, without clustering you have no chance to avert disaster and you’d be searching for the “root-cause” or the impending service disruption.
This is the point where things are hitting the fan; the situation rapidly explodes to 187 alerts. The customer’s old Netcool console turned into a sea of red (i.e. useless chaos), while Moogsoft saw it as all related and presented as just ONE Situation.
If you’re looking at this as a collection of alerts spread all across your Event List – perhaps ordered by severity or event type – you wouldn’t have noticed this Incident early on. You would likely have waited until after stage 4 when there is an explosion of alerts, at which point your customers have already been impacted for a significant amount of time.
With Moogsoft’s Situation Timeline, it’s now easy for anyone to understand how an Incident has unfolded and what was the root-cause and what were the symptoms. The timeline visualizes full situational awareness and gives you the ability to avert disaster at the first possible chance.
See the new Situation Timeline in Incident.MOOG, software release 4.1.15 or later. For more information, give us a holler at firstname.lastname@example.org, and we’ll make sure your stories around IT Operational Analytics all have happy endings.
To see our Situation Timeline in action, check out this short video demo…
About the author Sahil Khanna
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.