I had some rather large ‘WTF moments’ last week after speaking with three enterprise monitoring teams. The biggest was a service provider who was generating 600,000 events an hour across 40,000 servers…. and wait for it… 47,000 help desk tickets a month with 2,000+ level 2 escalations. That works out at 66 escalations a day, but that isn’t the bad news. The bad news is that those 47,000 help desk tickets had to be manually analyzed, prioritized and triaged by hundreds of people.

Event Management Today

Event management today is a manual, labor intensive (and expensive) activity for IT operations to scale. Enterprises are literally spending $10m+ a year managing event storms in hopes that they can detect anomalies and incidents before their business is impacted in production.

In the past, when event volumes were relatively small and static, enterprises typically managed events using a legacy Manager of Manager (MoM) like IBM Netcool, or CA Spectrum. IT operations would write and maintain some basic rules and filters to suppress and correlate events (e.g. if event A and event B occur at the same time, then merge both events to create event C). This form of event management worked when IT operations had hundreds of well-known events. Unfortunately, today IT operational teams are faced with millions of events, and not a single person can write the rules or filters fast enough to keep up. This is why machine learning and data science for IT operations is all the rage right now, and hence why I thought it would be good to compare two approaches to solving this event management problem.

The Perfect Event Storm

Below is a scaled-down (and rather simplistic diagram) that illustrates what a typical level 1, or enterprise monitoring team might experience. This diagram shows multiple event sources/tools, along with the number of events that are fired from each over the period of a day.

blog

Most enterprises would use level 1 operators to manually analyze each of the above 93 events and create tickets for ones that were deemed important or anomalous.

Challenges with this approach:

  • Takes operators 30+ mins to detect anomalies
  • Operators lack situational awareness and correlation as different operators might be analyzing different event sources
  • High event to ticket ratio
  • High frequency of duplicate tickets
  • High productivity burn for teams investigating tickets

Event Aggregation

One way to solve the problems listed above is to automate some of this event analysis for level 1 operators. Vendors such as BigPanda use event aggregation to achieve this. This is done by aggregating events/alerts by their event source (e.g. Nagios) and using things like host id and time as a means to reduce, group and represent multiple events as an incident.

For instance, using our event storm example we can see event aggregation could be applied to reduce and group these 93 disparate events as 15 individual incidents:

blog2

Now, instead of level 1 operators analyzing 93 disparate alerts, they would simply analyze 15 incidents that represent an 84% reduction in workload.

Benefits of Event Aggregation:

  • Works extremely well in small application environments where you have a few noisy event sources (e.g. < 5 event sources and < 100 hosts)
  • Reduce operators time-to-detect
  • Reduce event to ticket ratio

Challenges of Event Aggregation:

  • Operators still lack situational awareness across event sources so duplicate tickets and troubleshooting may still happen
  • The time window used for event aggregation (e.g. one hour) might be too coarse for highly dynamic environments where change is constant
  • Productivity burn still exists in large application environments where you have lots of noisy event sources (e.g. > 5 event sources and > 100 hosts)

Event Correlation

Another way to manage events is to reduce and correlate events across different event sources using machine learning algorithms like Moogsoft does. This is done by tokenizing and analyzing the natural language that exists in each event, and looking for different types of related attributes, patterns and anomalies that can be inferred. For example, you can apply topological algorithms to events to validate their related network proximity, in addition to applying time or linguistic algorithms for language similarity.

Using the same event storm example, we can see event correlation could be applied to reduce and group these 93 disparate events as 2 individual incidents:

blog3

Now, instead of level 1 operators analyzing 93 disparate alerts, they would simply analyze 2 incidents that represent an 98% reduction in workload.

Benefits of Event Correlation:

  • Works extremely well in large application environments where you have lots of noisy event sources (e.g. > 5 event sources and > 100 hosts)
  • Operators have complete situational awareness across event sources so duplicate tickets are minimal
  • Reduce operators time-to-detect
  • Reduce event to ticket ratio
  • Reduce frequency of duplicate tickets
  • Reduce productivity burn of teams investigating tickets

Challenges of Event Correlation:

  • Typically overkill for small application environments where you have few noisy event sources (e.g. < 5 event sources and < 100 hosts)
  • Integrating multiple event sources
  • Machine learning algorithms may need tuning/optimizing depending on environment

So there you have it, event management today is still a massive problem for enterprises and service providers. It’s literally costing them millions of dollars each year in labor. Software does exist to aggregate and correlate these events for IT operations, but each has their own benefits and challenges.

Which approach makes the most sense for you?

For more information on Moogsoft, or to request a 30-day trial, please contact info@moogsoft.com