Machine Learning

Our Secret Sauce.

Moogsoft uses machine learning for real-time detection of IT incidents as they unfold across your production applications, infrastructure and monitoring tools.

Why Machine Learning for IT Operations?

Having invented and commercialized IBM Tivoli Netcool in the 90s, our founders sat back and observed the advent of web-scale, virtualization, software-defined X, BYOD, mobility, cloud, big data, and lately IoT. They came to realize that the sheer volume, velocity, and variety of events and alerts emitted by the modern dynamic IT infrastructure, simply couldn’t be sustained by legacy event managers’ rule-based architecture. In large enterprise and service provider environments, these rules have grown to 1000s – now requiring some heavy computing power.

Machine Learning is best suited for environments and use cases where data sets are large, dynamic and variable.

Kinda like the event streams from your applications, infrastructure and monitoring tools.

It Turn This

It Turns This.

Into This (Actionable Insight)

Into This.

Without This (Rules & Config)

Without This.

Why is Moogsoft Machine Learning Unique?

Moogsoft’s machine learning is most advanced in three distinct areas:


Architected from the ground up for unsupervised machine learning, with no reliance on a CMDB, models or rules.


Focused on real-time clustering of related events into real, actionable incidents – we call them “situations”. This is not retrospective, or forensic analysis.

Multiple Algorithms

Built with multiple unsupervised and supervised algorithms to cluster related events into situations. We use multiple algorithms to inspect each event in many ways.

How Does Moogsoft Machine Learning Work?

During the initial cleaning process, our machine learning engine removes noise, blacklists unwanted events, and de-duplicates many others. This typically reduces hundreds of millions of raw events per day down to 1 million alerts. At this point, the engine then contextualizes the still large volume of alerts into much fewer incidents – called “situations”. The engine looks at multiple variables to assess how “surprising or abnormal” an event is, and how it relates to other events. These variables are:

  • Time


    The engine uses unsupervised learning to identify clusters of alerts that are temporally correlated, identifying underlying service outages or situations. The engine spots unusual patterns in the timestamps of events that may indicate that these events are related.

  • Linguistic


    The engine uses unsupervised learning to detect linguistic relationships in events. It groups alerts according to the similarity of linguistic attributes.

  • Topology


    The engine uses unsupervised machine learning to cluster events based on their network proximity – events from a similar location as being potentially correlated.

  • Ops-Team-Defined-Template


    IT Ops teams can create a template using a discovered situation, which can then be used to compare against a future situation. If there is a close match, IT Ops can use the template to either reject the future situation as a noise, or kick off a specific remediation script/process, or do something in between.

  • Moogsoft Machine-Learned-Feedback

    Moogsoft Machine-Learned-Feedback

    Our engine can automatically learn from what the IT Ops team did from a situation previously and re-apply those actions. For example, ignore, or execute a set of remediation scripts.

  • Deterministic Cookbook-based (optional)

    Deterministic Cookbook-Based (optional)

    It gives you complete control over which alerts get clustered into Situations. It allows you to create Situations according to a pre-defined Recipe (streaming SQL filters trigger the application of selected algorithms to events). The Cookbook gives you the power to create situations in a fully deterministic fashion, while retaining the power of the machine learning algorithms.

What Are The Benefits?


Works in real-time as incidents unfold, leading to faster Mean-Time-To-Detect and Mean-Time-to-Restore.

Enables your support teams to detect incidents through push notifications before end users detect them for you.


Proven to scale to 115 million raw events a day from the most diverse event feeds across the entire IT infrastructure.

Capable of processing up to 80,000 events/second.

Monitoring over 2 millon hosts and counting.


Our engine reduces “noise” significantly, has less data to handle, and is therefore less error-prone. It ignores alert priority set by vendors and can detect many early stage issues or transient issues that may develop into incidents. The result: We don’t create false negatives in our customer production environments.

Finally, our multiple algorithms for clustering gives us a 90% accuracy in identifying real situations, i.e., 10% false positives.

Moogsoft founders, technologists and engineers figured out an elegant, algebraic algorithm, which uses traditionally offline techniques to render computable in real time. Rather than analyzing data for every possible combination of events to situations, this approach narrows down the combinations dramatically – minimizing computational complexity. This results in our advantages in processing speed, scale, and accuracy.