Machine Learning has been around for over 50 years, but it’s just now surfacing mainstream interest from businesses across verticals. The use cases for ML for data-rich organizations are essentially infinite, but the role that ML will play in the digital era that we are amidst is best described by Jeff Bezos:
“Over the past decades, computers have broadly automated tasks that programmers could describe with clear rules and algorithms. Modern machine-learning techniques now allow us to do the same for tasks where describing the precise rules is much harder.”
In the world of IT monitoring, the introduction of machine learning couldn’t be more urgent, considering that all monitoring and event management tools (past and present) still require manually-built rules to correlate events. This approach worked just fine in the 1990s when environments were relatively simple and static, but today’s enterprise scale productions change faster than humans can physically react to. Even with the best discovery and configuration/change management technology on the market, today’s IT environments cannot be modeled, and therefore cannot be managed by rules.
Funny enough, the IT monitoring and management landscape today is saturated with new vendors that are still addressing event correlation the old way.
In this blog, I will explain 1) the limitations of rules; 2) how other vendors are addressing event correlation; 3) how Moogsoft is addressing event correlation; and 4) why one approach is better than the rest.
There’s Nothing Algorithmic About Rules
So what do we mean when we talk about rules?
Rules are user-defined logic and variables. They require specific and well-defined inputs (IF “ ” THEN “ ”), and need to find exact matches to provide outputs.
The challenge is that, when you’re being this specific, you actually have to map out all possible scenarios that could occur in your environment. This requires that you build thousands, or even hundreds of thousands of rules that need to be constantly maintained and updated. Guess what happens every time a change occurs? New rules need to be written. Rules-based solutions typically require two to three dedicated resources (minimum).
Furthermore, this means that you can only detect what you’ve seen before. This is a problem, considering that EMA reported that 27% of IT incidents are repeated, meaning that 73% have never been seen before.
Here is a beautiful example of a single rule from BMC BPPM:
The Event Correlation Landscape
Let’s look at a few examples of technologies that claim clustering/correlation abilities, and see what’s really going on behind the scenes.
IBM Netcool Operations Insight (NOI):
In the case of IBM, the technology actually isn’t new. NOI is the rebranded and bundled version of the 1990s Netcool technology. They claim ‘Advanced Event Analytics,’ yet NOI correlation requires an administrator to manually build correlation rules that get triggered when those exact conditions are met.
If you know someone using NOI in large scale environments, they will tell you that they have built thousands of correlation rules, and continue to do so at a rapid pace.
But hey — no one gets fired for buying IBM…right?
ServiceNow is undoubtedly the leading ITSM solution on the market. Interestingly, they have also started to position event correlation with their event management and ITOM offering, led by their acquisition of the ServiceWatch product.
This works by administrators manually classifying alerts into primary and secondary categories, and establishing a relationship between them. Administrators must then manually build alert correlation rules to group alerts that are related. This info is stored within the ServiceNow event tables, and then applied to data that is aggregated once a day, or more frequently with a scheduled job. Aka — it’s not done in real time.
With ServiceNow, correlation is limited to CI attributes and Time values.
PagerDuty is an SMB-focused notification tool. They aggregate events across many sources and, aside from notification and escalation, they can suppress via manual rules.
Through the Infrastructure Health Application of the Operations Command Console, users can manually build rules to roll multiple event sources into a single user-defined ‘service.’ From there, users can apply filters across various attributes of alerts to look for potential relationships.
Unlike all previously mentioned products, BigPanda is actually a 21st century tool, purpose-built for IT incident management. But can you guess what’s under the hood? Spoiler alert: it’s more rules.
Their Alert Correlation Engine (suspiciously renamed as such the week after Moogsoft’s Algorithmic Clustering Engine was announced) can group alerts from various sources through pre-built rules, and can be enhanced by custom-built rules. BigPanda can group alerts by finding matches across just three fields: Topology, Time, and Context.
Topology = host name or agent
Time = 2 hour or 30 minute window
Context = Keyword matches
As an example, based on the short window of time between the announcement of Moogsoft’s Algorithmic Clustering Engine (ACE) and BigPanda’s Alert Correlation Engine, combined with the textual similarity between the two, BigPanda should be able to detect that one has influenced the other and that they are, in fact, related.
With that, let’s talk about Moogsoft AIOps, and how the Algorithmic Clustering Engine (ACE) is liberating ITOps and DevOps from the limitations of rules.
What is Moogsoft’s Algorithmic Clustering Engine (ACE)?
Moogsoft’s founders are pioneers in IT Event Management, being the inventors of Netcool/OMNIbus, now owned by IBM. When they decided to readdress the challenge of event correlation for the digital enterprise, suffering from rules-based approaches, they started by applying machine learning algorithms in their pure form to large volumes of IT telemetry. They quickly found that these algorithms didn’t work out-of-the-box, and that this was a serious challenge.
Even with training algorithms with massive amounts of historical data, results were weak because today’s production environments are completely unpredictable. Over the past four years, the Moogsoft team has partnered with academic researchers and leading enterprise IT organizations to iteratively master a comprehensive series of machine-learning algorithms, to make them enterprise-ready, working in real time, and purpose-built for ITOps and Incident Management use cases — 14+ patents later, these algorithms are now synchronously deployed across some of the largest IT environments in the world within a single product. And best of all, it works.
Moogsoft’s Algorithmic Clustering Engine (ACE) is a new category of machine learning that has been optimized and fine-tuned for the use cases of ITOps and DevOps. ACE is a data-driven approach to IT event correlation that dynamically clusters related alerts into Actionable Situations in real time. ACE combines the power of supervised and unsupervised algorithms to deliver actionable insight without dependency on static rules.
How does it work?
ACE infers relationships by applying a variety of techniques, including blacklisting, whitelisting, entropy, soft fuzzy matching, textual similarity, time occurrence, network proximity, graphing, neural feedback, and more. This unique approach to IT Event correlation provides your team with an instant, proactive view into service-impacting issues before your customers detect them. Best of all, these algorithms learn on the fly, and do not require historical training.
Moogsoft AIOps plugs into a variety of event sources (ex: AppDynamics, Splunk, Solarwinds, Nagios, etc.) and ingests massive volumes of events in real-time.
Lightweight ACE definitions tell Moogsoft AIOps how to route data from these sources to the appropriate algorithms. It take minutes to create and requires zero maintenance. Our customers tell us that one definition is worth hundreds of rules.
As an example, GoDaddy currently manages 480 million events/week with just a few ACE Definitions and one part-time dedicated resource.
Unlike rules-based approaches, ACE doesn’t require well-defined inputs and exact matches. Instead, ACE focuses on key attributes of interest, and automatically applies combinations of unsupervised machine intelligence and supervised neural feedback to discover relationships.
Furthermore, ACE is agile and adaptive. The algorithms can adapt to application and infrastructure changes, meaning that it’s easy to deploy and maintain.
About the author Sahil Khanna
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.