One of the most prominent terms in the vocabulary of anyone who works in IT is ‘Root-Cause.’ Highly skilled teams across IT organizations dedicate their careers to investigating the root-cause of service impacting incidents, and they use tools that are supposed to help them identify those root-causes, typically through the use of historical models.
However, the only definitive way for root-cause analysis to be 100% accurate is to model every potential outcome of your IT environment. In today’s virtualized and highly redundant IT environments, this is clearly impossible. The outcomes and features of an enterprise-level IT environment are unpredictable at any given moment in time.
“At Moogsoft, we embrace unpredictability.”
Richard Whitehead, Chief Evangelist, Moogsoft
Incident.MOOG applies machine-learning to massive volumes of IT telemetry in real-time to identify truly anomalous features that get clustered into groups of related alerts — we call them ‘Situations.’ This takes immense heavy-lifting away from humans.
But once you have a Situation, how does the operator quickly identify what caused it?
This can often be successfully accomplished by looking at the Situation Timeline or at the Knowledge Tab, where Incident.MOOG presents similar Situations from the past along with the remediating steps that were taken. However, to increase the degree of certainty, Moogsoft has taken a huge leap forward.
In the v5.1.7 release of Incident.MOOG, Moogsoft announced the introduction of Probable Root Cause (PRC) to some customers as an Alpha test.
What is Probable Root Cause?
Probable Root Cause (PRC) is a supervised machine-learning process that interprets patterns in user-supplied feedback to identify which alerts in a Situation are ‘root-causes.’
Once the system’s neural net is adequately trained, PRC provides insight into where to begin troubleshooting and diagnosis, reducing the burden on operators and dramatically speeding up incident resolution.
How Does It Work?
When an operator identifies the root-cause(s) of a Situation, they’ll be able to label Alerts within Situations as Causal and Non-Causal with a single click.
User-Defined Root Cause
PRC will enable Moogsoft to learns each time this is done. When new Situations are generated, Incident.MOOG will assign an Alert or Alerts as having a ‘Root Cause Estimate.’ The Root Cause Estimate can range from 0-100%, and will represent a very accurate estimate of causality, which will only get better as the sample size increases. Each ‘bar’ for the Alerts represents a 10% probability that the Alert is the Root Cause for the Situation being viewed.
Root Cause Estimate
Each Situation will indicate a ‘Max Root Cause,’ which will indicate the probability that the Situation contains a causal Alert. A value of 3%, for example, means that no Alert has more than 3% probability of being the Root Cause. A value of 98% means that at least one Alert has a 98% probability of being the Root Cause.
Max Root Cause for Situations
How Does Incident.MOOG Learn from Probable Root Cause?
Moogsoft’s PRC feature will apply machine-learning techniques that leverage features like Severity, Host, Description, and Class, and will use a Neural Network to estimate the root cause probability for all alerts within a newly created Situation. PRC will work even if the Situation has never been seen before.
With the coming integration of PRC into AIOps, Moogsoft will allow ITOps and DevOps teams to leverage machine-learning technology to learn from their everyday actions, and streamline future troubleshooting and diagnosis. Instead of applying rules and models to unpredictable environments, Moogsoft will allow you to loosen your constraints and embrace unpredictability by leveraging data-driven models.
About the author Sahil Khanna
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.