Over the last several decades, an abundance of approaches to root-cause-analysis and event correlation have been developed and implemented into  products. Unfortunately for the modern infrastructures, these approaches and techniques have more or less failed. Why? It largely has to do with the fact that most of the methods come from an era when modern day gas prices averaged $1.85/ gallon.
The last widely accepted publication on principles in fault localization techniques was circulated in 2004 and introduced a model by Steinder and Sethi, including 3 main overlapping approaches to root-cause-analysis: Artificial Intelligence, Model Traversing, and Fault Propagation techniques.

M.L Steinder and A.S Sethi, A survey of fault localization techniques in computer networks. Science of Computer Programming, 53(2):165-194, 2004.

The 3 Main Approaches to Root Cause Analysis

Or more importantly, why each of these techniques is inadequate in our new, virtualized, container driven, open-sourced world:

AI Techniques

Ignoring the fact that AI techniques for Steinder and Sethi revolve around rules-based systems in disguise (expert systems) and don’t touch on alternative methods, like machine-learning, that are disposable to us today, most are linear models, unable to deal with unseen symptoms or inaccurate information. This means that they deteriorate rapidly as configurations change.

So in a large scale environment, using this method to detect faults is like self-diagnosing using WebMD. If you’re tired but show no other detectable symptoms, you must be healthy. Heaven forbid you have a cough, as that is obviously terminal.

Model Traversing Techniques

Any approach that uses relationships between systems to model RCA is doomed to depend on those predefined models. Not only do they run into huge problems when your systems change, but they also can’t spot recurring events. Since systems never change and history never repeats itself, you have nothing to worry about, right?

Fault Propagation Models

This avenue forces you to start from what you already know and move backwards. i.e. If one thing is failing, there is a 40% chance that its cause was some pre-existing alert. Yes, the probability of failures are incredibly useful for root-cause analysis, but forgive me for being a skeptic as your insights into possible failures revolve around the fact that something is already on fire.

While there are benefits to each one of these techniques – and Steinder and Sethi even suggest using a mix of these for improved accuracy – none are adequate for root-cause-analysis in the modern world. Anything rules-based, which all the above techniques ultimately are, becomes a bottleneck as systems change – which they do constantly.

Sole reliance on rules should be thrown out and forgotten simply because it limits systems to solving for problems, biased by what has been seen in the past. Although 35% of incidents are repeat occurrences, a massive 65% are brand new. Solving for the novel faults with systems reliant models of past experiences is like taking a detoured route to work for the rest of your life, just because your usual road is under construction this week.

Machine Learning and a New Model of Uncovering Root Cause Analysis

This is exactly what Phil Tee, CEO of Moogsoft, and Adam Frank, Alarm & Event Manager of Royal Bank of Canada, presented at the Incident Resolution Summit in Austin Texas this last Tuesday. Recording found here.

They introduced a revolutionary model consisting of supervised and unsupervised machine learning to detect patterns in event storms from across the entire production stack. The grouped and correlated alerts, coined as Situations, become the backbone of contextual awareness for support teams as they battle event overload and system outages.

Supported by a virtual war room and a knowledge base where resolution steps are clearly marked, the Situational Awareness Approach solves all previously mentioned limitations to Root Cause Analysis. For Royal Bank of Canada, the number of actionable events dropped from 3,500 to 1,500 within 2 months of deploying this new system over their old legacy manager.

Furthermore, the system was able to cluster the 1,500 alerts into 150 Situation – dramatically reducing operational workload so teams can resolve incidents before they reach end-users.

Ready to trial this new technology for yourself? Start now.

Look out for Phil Tee’s new academic paper that explains the state-of-the-art of root-cause-analysis algorithms which is set to publish early next year. Of course, we will update this post with more content as it becomes available.