The Limits of Machine Learning for IT Operations Management

Wednesday July 27 2016 | Jason Bloomberg

The best way to get value from machine-learning-driven root-cause analysis is to pair automated analytics & human curation.

The Limits of Machine Learning for IT Operations Management

The primary challenge with big data? Figuring out what to do with so much data, of course. Collecting them, storing them, and even moving them around — no problem. Squeezing value out of them is another story.

IT operations analytics (ITOA) data is a case in point. We’ve now figured out how to generate and collect vast quantities of data, from log files to infrastructure metrics to bug-tracking data to real-time business data. Somewhere in all that dross are the gems of insight — the golden nuggets of information that can help us crack the toughest ITOA nuts, from identifying root causes of complex issues to predicting what’s going to happen next to actually preventing such problems in the first place.

Over the last few years, a number of big data analytics innovations have entered the marketplace to be sure — and perhaps the most promising of all is machine learning. The idea behind machine learning is software that can learn from experience — software that can get better with practice at whatever task we’ve set out for it, as though it were learning to play the piano.

Machine learning has an unmistakable appeal, as it gives techies a break. They no longer have to know how to solve particularly difficult problems directly. They simply have to be able to teach a piece of software how to learn how to solve the problems. Free the software to crunch your data, and lo and behold, the solution soon appears as though from thin air.

Machine Learning & ITOA

Today, machine learning-based analytics approaches are useful for particular types of problems. In the ITOA arena, the approach is particularly helpful for identifying certain root causes of issues that IT operators are responsible for fixing.

There are several reasons why cause identification is a difficult problem. Typical enterprise operational environments contain many disparate and dynamic elements, from networks to physical servers to virtual machines to databases to middleware to containers — and the list keeps growing. When something goes wrong with one component, the underlying cause may be a problem with an entirely different component.

The basic approach for identifying the root cause is to start with the adverse event and go back in time through the operational data looking for “what changed.” We assume everything was working properly until something changed, and that event led to the adverse event we’re investigating, often through a series of intermediate causes and effects.

Machine learning is particularly useful for this type of “what changed” analysis, as machine learning algorithms can easily crunch existing data to understand the “before” picture, thus making what changed easier to recognize.

Once the machine learning algorithm identifies a cause, it learns from the experience — and is thus able to recognize similar situations when they come along again. Machine learning-based predictive analytics comes from this ability to recognize patterns in the data that led to issues in the past, as such patterns indicate higher odds of such issues happening again.

The Limitations of Today’s Machine Learning

There’s no question that machine learning has come a long way over the last few years. For example, applying machine learning techniques to real-time alerts was out of reach until recently, as Moogsoft has now implemented real-time machine learning that can quickly build incidents for the ops team.

However, in spite of such innovations, machine learning still has a way to go. For example, the more unusual an event becomes, the less able machine learning is to predict it.

In particular, what we might call “zero day” events — causes of issues that haven’t occurred before — are entirely outside the ability for machine learning to predict or analyze. Even rare (but not unique) events represent insufficient data for machine learning to work with.

While predicting zero day events may be beyond the power of machine learning today, identifying them once they occur is something that ITOA may be able to handle. The key, however, is identifying such events quickly — ideally in real-time. Moogsoft AIOps is one of the few ITOA tools on the market today that offers the real-time machine learning necessary for prompt detection of such zero day events.

Machine learning’s predictive sweet spot, therefore, centers on unusual events that are not particularly rare — an important subset of all the sorts of issues an ops team will want to track down to be sure, but only a subset.

Machine learning is also quite poor at dealing with incorrect information. For example, let’s say the clock on one system is two minutes off. As a result, the timestamps in all the log files from that system are also off, throwing a wrench into the temporal analysis that is at the core of root cause analysis.

A human might mull over this problem for a while, and would likely figure out at some point that the timestamps were off. Machine learning, however, is unlikely to come to such a conclusion.

A third weakness for machine learning: missing information. Let’s say five systems are involved in a cascading failure, with the root cause affecting the first system, which affects the next system in turn, and so on.

If we had log and monitoring data for all the systems, then machine learning would work fine. But let’s assume one of the systems in the middle of the sequence wasn’t generating any data. In this situation, machine learning is unlikely to be smart enough to fill in the blank — and may miss the causal connections altogether as a result.

The final situation, which is a challenge for humans and machine learning-based software alike, is when there are two or more independent causes of a particular problem. A straightforward “what changed” analysis is unlikely to uncover either of the causes.

The Intellyx Take

It’s important to note that machine learning is undergoing a period of intense and rapid innovation, so everything in this article is a moving target. Vendors may be making improvements on any or all of the weaknesses I’ve called out as we speak.

That being said, it’s still important to recognize that the hype surrounding machine learning has sprinted ahead of actual capabilities by at least a few years — and that some of the products on the market may not do everything their companies say they will.

In the meantime, the best way to get value out of machine learning-driven root-cause analysis is to pair automated analytics with human curation of ops knowledge. When humans are better at a task, then the technology should support them, rather than the other way around.

This principle of machine learning combined with human curation should drive your selection of monitoring tools as well. For example, Moogsoft hides the complexity of the underlying automated data analysis, while presenting the ability for humans to collect, curate, and reuse the institutional knowledge that is still such an important part of the ops world today.

Copyright © Intellyx LLC. Intellyx publishes the Agile Digital Transformation Roadmap poster, advises companies on their digital transformation initiatives, and helps vendors communicate their agility stories. As of the time of writing, Moogsoft is an Intellyx customer. None of the other organizations mentioned are Intellyx customers. Intellyx retains final editorial control of this article.

Moogsoft AIOps helps modern IT Operations and DevOps teams become smarter, faster, and more effective by providing technological supplementation that automates mundane tasks, enables scalability, and frees up human beings to do what they do best — ideate, create, and innovate. Start your free trial today by clicking here.

Leave a Reply