One of the most prominent terms in the vocabulary of anyone who works in IT is ‘Root-Cause.’ Highly skilled teams across IT organizations dedicate their careers to investigating the root-cause of service impacting incidents, and they use tools that are supposed to help them identify those root-causes, typically through the use of historical models.

However, the only definitive way for root-cause analysis to be 100% accurate is to model every potential outcome of your IT environment. In today’s virtualized and highly redundant IT environments, this is clearly impossible. The outcomes and features of an enterprise-level IT environment are unpredictable at any given moment in time.

“At Moogsoft, we embrace unpredictability.”

– Richard Whitehead, Chief Evangelist, Moogsoft

Incident.MOOG applies machine-learning to massive volumes of IT telemetry in real-time to identify truly anomalous features that get clustered into groups of related alerts — we call them ‘Situations.’ This takes immense heavy-lifting away from humans.

But once you have a Situation, how does the operator quickly identify what caused it?

This can often be successfully accomplished by looking at the Situation Timeline or at the Knowledge Tab, where Incident.MOOG presents similar Situations from the past along with the remediating steps that were taken. However, to increase the degree of certainty, Moogsoft has taken a huge leap forward.

In the latest release of Incident.MOOG (5.1.7), Moogsoft announces the introduction of Probable Root Cause (PRC).

What is Probable Root Cause?

Probable Root Cause (PRC) is a supervised machine-learning process that interprets patterns in user-supplied feedback to identify which alerts in a Situation are ‘root-causes.’

Once the system’s neural net is adequately trained, PRC provides insight into where to begin troubleshooting and diagnosis, reducing the burden on operators and dramatically speeding up incident resolution.

How Does It Work?

When an operator identifies the root-cause(s) of a Situation, they can now label Alerts within Situations as Causal and Non-Causal with a single click.

User-Defined Root Cause

User-Defined Root Cause

Incident.MOOG learns each time this is done. When new Situations are generated, Incident.MOOG assigns an Alert or Alerts as having a ‘Root Cause Estimate.’ The Root Cause Estimate can range from 0-100% and represents a very accurate estimate of causality, which only gets better as the sample size increases. Each ‘bar’ for the alerts represents a 10% probability that the Alert is the Root Cause for the Situation being viewed.

Root Cause Estimate

Root Cause Estimate

Each Situation will indicate a ‘Max Root Cause,’ which indicates the probability that the Situation contains a causal Alert. A value of 3%, for example, means that no Alert has more than 3% probability of being the Root Cause. A value of 98% means that at least one Alert has a 98% probability of being the Root Cause.

Max Root Cause for Situations

Max Root Cause for Situations

How Does Incident.MOOG Learn from Probable Root Cause?

Incident.MOOG applies machine-learning techniques that leverage features like Severity, Host, Description, and Class, and use a Neural Network to estimate the Root Cause probability for all alerts within a newly created Situation. PRC works even if the Situation has never been seen before.

Incident.MOOG applies machine-learning techniques

With the introduction of PRC, Incident.MOOG allows IT Ops and DevOps teams to leverage machine learning technology to learn from their everyday actions, and streamline future troubleshooting and diagnosis. Instead of applying rules and models to unpredictable environments, Incident.MOOG allows you to loosen your constraints and embrace unpredictability by leveraging data-driven models.

For any questions on Incident.MOOG’s PRC, reach out to

Get started today with a free trial of Incident.MOOG—a next generation approach to IT Operations and Event Management. Driven by real-time data science, Incident.MOOG helps IT Operations and Development teams detect anomalies across your production stack of applications, infrastructure and monitoring tools all under a single pane of glass.