While machine learning has found a place on the “Buzzword Bingo” board, (along with Big Data, Data Science, DevOps, etc.), this advancing technology has proven to be more than just marketing jargon at Moogsoft. Last week, we hosted a webinar with our friends at DevOps.com to clarify what machine learning really is and isn’t, and how this technology can be used by IT Operations teams to reduce production incidents and outages. During this webinar, Moogsoft’s VP of Product Marketing, Steve Burton, and Evangelist in Chief, Richard Whitehead, discuss use cases and techniques regarding the ways in which machine learning is being applied to automate incident management for service assurance.
View the entire recording of our webinar, “How to Reduce Production Incidents and Outages with Machine Learning,”:
Go through our SlideShare presentation for a more detailed recap of the webinar slides:
Want to get started with machine learning in your IT production environment? Now you can give Moogsoft a try. Get a 30-day trial by filling out this form:
[Q&A] How to Reduce Production Incidents and Outages with Machine Learning
Q: To what extent does Moogsoft’s machine learning algorithms adapt to work out of the box versus needing to configure and collaborate to work on my specific IT environment?
Steve Burton: It depends on the environment. Every environment is different, and there are multiple algorithms applied to the processing. Some work really well out of the box and some need tweaking – It really does depend on the environment. We typically see 65-75% working out of the box. The final tweaking it depends on is giving it feedback (like if it’s right or wrong), but it does depend on the environment.
Richard Whitehead: That’s a really good point, Steve, and we have a phase we call calibration. It’s an interactive phase that we follow with the operators because even the unsupervised algorithms are tunable. The goal is to provide a series of situations that are meaningful to the individuals. The folks looking at the single pain of glass are the best people to say what’s meaningful and authentic. In some cases, operations’ requirements differ from definitions prescribed by system architects. What constitutes a valued situation in an operations context might be slightly different. It may be correct to say this problem has occurred, but it will be grouped differently to match the way they’ll go to reboot servers. So, yes, they do work outside of the box, but do expect something of a calibration process to make things work well.
Q: Is there a threshold in terms of events per second in scale in order for machine learning to be effective?
Steve Burton: The more events the better. The larger the data set is, you’ll get better results. If your environment is small, fairly predictable and fairly static, then machine learning isn’t a best fit for those environments. The algorithms need data points, and the more data points you have, the more it can understand relationships. For me, anything more than 10,000 events a day is typically applicable for machine learning. Basically, anything humans can’t cope with is the threshold.
Richard Whitehead: Obviously, the best benefit you’ll see in these environment are these massive numbers in volume reduction, which means you’re doing significant amounts of clustering. But I’ve seen environments where the customer said, “Hey Richard, there is no event volume use case here.” In theory, to create a cluster you can argue you need at least 2 events to group together, otherwise there is no clustering going on. In that case, if you clustered every pair of events, that’s still a 50% event volume reduction. There isn’t really a minimum threshold.
Q: You mentioned clustering a few times in terms of what you use machine learning for. Can you elaborate on the clustering concept a bit more?
Richard Whitehead: Clustering in our context is the technique we use to generate situations. So if you were to look at a graph of the alerts as they come in, the horizontal axis is the cluster. Depending on the technique that’s used, it’s a group of related pieces of information, either related by time, natural language content and an external model in the form of a topology graph, the Y-axis representing time. You put those 2 things together and you get a situation. So, from our standpoint, clustering is technique we use to group related alerts together to generate a situation.
Q: How does machine learning deal with normal behavior that is not causing issues?
Steve Burton: In terms of normal behavior, if you look at environments and the events and exceptions that are thrown, you typically wouldn’t have things like “critical,” “sever,” “fatal,” or words associated with abnormal behavior. Conversely, if everything seems fine, and you do have events that have critical or fatal, but they are thrown every day and week of the year, the algorithms can learn this and say, “I see this everyday and this isn’t abnormal.” A lot of it comes down to linguistic content and words that are associated with what the algorithms are observing. But again, if the machine learning algorithms get it wrong and detect the situation, which an incident is, and the operators say this isn’t a real situation, that sends the feedback loop to the algorithms, and then things get adjusted on the fly.
Q: How are you able to correlate between the virtual software and the physical infrastructure layers below if you don’t have access to asset configuration database, or do you require that access?
Richard Whitehead: That’s basically where this notion of relaxing constraints comes in. In an ideal world, you’d understand the relationships because that CMDB is being driven on demand by the provisioning solution. The reality is, however, that’s not true. We have at least one customer who told us that anybody who’s CMDB that is more than 35% accurate deserves a medal. That said, how else could you make those relationships? That’s where being data driven comes in. You can help yourself by providing significant information in the data stream. We are seeing that a lot in new environments, such as OpenStack, where they use ‘tagging’ to put the labels in the raw data to use for correlation upstream, but as information becomes scarcer, you relax constraints and start to leverage the more unsupervised technique. So ultimately, the catch hold is time. And if two things complain at the same time, than they may be related. This goes back to the layered approach. Reliance on one technique alone may yield unanticipated results, but apply multiple techniques and you can focus in on the issue.
Q: On one of the slides in the middle of the webinar, you list different tasks that machine learning can help IT teams with. Are the different machine learning algorithms more suited for one or more of these tasks than others?
Richard Whitehead: Obviously, if your interest is in geo-location, then you’re going to be using a semi-supervised technique where you’re applying your clustering algorithms to very specific labels and areas within the data. For instance, you might be looking for 3 letter NFL city acronyms (SFO, NYC, etc.) and that might be everything you need to generate meaningful data. If your looking for that elusive needle in a haystack, however, and you have a vast amount of data coming from an application log or something like that, and you just want to look for the anomaly, then you would probably be using some type of technique like significance detection where you use natural language and the anomaly detection. So yes, it depends on what you’re looking for as to which specific technique you apply.
Q: What source of the alerts can be integrated with Moogsoft’s source? For instance, can EMC Smarts be integrated, MSSCOM, Syslog, Traps, etc.
Steve Burton: Short answer is anything because we use natural language processing or rather, most machine learning algorithms rely on natural language processing. It doesn’t matter the structure, syntax and words that are used. As you saw, Richard actually ingested Tweets. For example, Moogsoft has over 100 integrations with New Relic, AppDynamics, Splunk, Syslog, VMware – All of the common event sources that a typical customer might have. Where there isn’t support, we have adjacent rest APIs, so you can just pump in the events and alerts through a rest API and the machine learning algorithms will tokenize it and analyze all the structure of the text.