A Closer Look at Root Cause Analysis
moogsoft | July 28, 2014

Operations is a multi-user, multi-system, multi-technology environment. Naturally, single root causes are a thing of the past.

Operations is a multi-user, multi-system, multi-technology environment. Naturally, single root causes are a thing of the past.

Historically, event management tools focus on determining the root cause of an incident. That’s a problem…

Traditional root cause analysis is becoming less and less reliable. At least three things affect real root cause and its functionality in today’s environment: mobility, software defined networks and the “single-user” mindset.

First, consider mobility. When things break, virtual machines can vMotion to other servers, other networks, and other topologies. Even network changes occur transparently.

For example, an MPLS fast reroute occurs in 55 microseconds. And when it does, the characteristics of the network may change. Latency may change. Routing may change. And the applications communicating over the network may change; TCP Window size changes, etc.

Next, Software Defined Networks (SDNs) and Network Functions Virtualization (NFV) add to the mix, supporting the ability to virtualize and dynamically implement networks to support business objectives on demand.

Because of this, event correlation has become somewhat nebulous and ineffective. For example, many correlation techniques are applicable only to static infrastructures. They assume that the Infrastructure is set in stone… and it is not. What happens when portions of your applications are migrated to the cloud?

And there’s a third factor. The real misnomer is that in the legacy management system, the root cause hand-off is still to a single user.

The reality is this: Operations is a multiuser, multi system, multi-technology environment. One root cause may be pertinent to more than one user, more than one system, and more than one application. Nothing is in a vacuum. There is cause and effect in every nook and cranny.

With the adaptive nature of todays infrastructures, you need management technology that is adaptive as well. Incident.Moog uses machine learning and clustering techniques to apply adaptive intelligence to your incident processes and enables the focusing of the situations to the people affected by the alerts within the situation.

Incident.MOOG doesn’t identify “a” root cause – rather it presents the Situation in context, creates a Situation Room (i.e., virtual war room) and invites the appropriate stakeholders to the room. Then, those stakeholders to the Situation can quickly work out whether they are the causal or impacted/collateral party to the Situation and react accordingly. The result is earlier and more efficient problem resolution.

Moogsoft is a pioneer and leading provider of AIOps solutions that help IT teams work faster and smarter. With patented AI analyzing billions of events daily across the world’s most complex IT environments, the Moogsoft AIOps Platform helps the world’s top enterprises avoid outages, automate service assurance, and accelerate digital transformation initiatives.
See Related Posts by Topic:

About the author


Moogsoft Resources

July 22, 2020

What’s Observability with AIOps? Check Out Our New Book, Webinars and Infographic

July 21, 2020

Why Observability Matters to Site Reliability Engineers

June 29, 2020

Moogsoft Express Helps DevOps and SRE Teams Develop More and Operate Less

June 24, 2020

AIOps Applied to Observability Will Automate Your Monitoring