Traditional approaches to root cause analysis are inadequate in the modern, virtualized, container-driven, open-source world.
Over the last several decades, an abundance of approaches to root-cause-analysis and event correlation have been developed and implemented into products. Unfortunately for the modern infrastructures, these approaches and techniques have more or less failed. Why? It largely has to do with the fact that most of the methods come from an era when modern day gas prices averaged $1.85/ gallon.
The last widely accepted publication on principles in fault localization techniques was circulated in 2004 and introduced a model by Steinder and Sethi, including 3 main overlapping approaches to root-cause-analysis: Artificial Intelligence, Model Traversing, and Fault Propagation techniques.
M.L Steinder and A.S Sethi, A survey of fault localization techniques in computer networks. Science of Computer Programming, 53(2):165-194, 2004.
The 3 Main Approaches to Root Cause Analysis
Or more importantly, why each of these techniques is inadequate in our new, virtualized, container driven, open-sourced world:
Ignoring the fact that AI techniques for Steinder and Sethi revolve around rules-based systems in disguise (expert systems) and don’t touch on alternative methods, like machine-learning, that are disposable to us today, most are linear models, unable to deal with unseen symptoms or inaccurate information. This means that they deteriorate rapidly as configurations change.
So in a large scale environment, using this method to detect faults is like self-diagnosing using WebMD. If you’re tired but show no other detectable symptoms, you must be healthy. Heaven forbid you have a cough, as that is obviously terminal.
Model Traversing Techniques
Any approach that uses relationships between systems to model RCA is doomed to depend on those predefined models. Not only do they run into huge problems when your systems change, but they also can’t spot recurring events. Since systems never change and history never repeats itself, you have nothing to worry about, right?
Fault Propagation Models
This avenue forces you to start from what you already know and move backwards. i.e. If one thing is failing, there is a 40% chance that its cause was some pre-existing alert. Yes, the probability of failures are incredibly useful for root-cause analysis, but forgive me for being a skeptic as your insights into possible failures revolve around the fact that something is already on fire.
While there are benefits to each one of these techniques – and Steinder and Sethi even suggest using a mix of these for improved accuracy – none are adequate for root-cause-analysis in the modern world. Anything rules-based, which all the above techniques ultimately are, becomes a bottleneck as systems change – which they do constantly.
Sole reliance on rules should be thrown out and forgotten simply because it limits systems to solving for problems, biased by what has been seen in the past. Although 35% of incidents are repeat occurrences, a massive 65% are brand new. Solving for the novel faults with systems reliant models of past experiences is like taking a detoured route to work for the rest of your life, just because your usual road is under construction this week.