3 Things to Know About AI/ML in the DevOps Toolchain
Richard Whitehead | November 29, 2021

At DevWeek Austin, we discussed how AI and ML have come to the DevOps toolchain and are a great fit! Here are the 3 main takeaways.

At DevWeek Austin, we discussed how AI and ML have come to the DevOps toolchain and are a great fit! Here are the 3 main takeaways.

1. The Past, Present and Future

AI and ML have made an appearance over the past few years. Firstly, in trying to automate Service Desk workflow and then secondly, Remediation. A more modern interpretation of this is being considered for automated remediation or self-healing. While fully automated remediation and self healing is still a way off, AI can be used to learn the most likely runbooks based on previous operator choices and interactions.

Right now, AI and ML are having the most significant impact on the Ops side of DevOps. When considering the “three ways” flow, feedback, and experimentation, ML in observability and monitoring tools significantly accelerates feedback. In addition to the benefits of rapidly identifying issues in production, an unexpected but welcome benefit of the adaptability of ML means less configuration is required, further accelerating the experimentation.

In fact, AI has made such an impact; there’s even a term for it “AIOps”. It’s a real thing created by Gartner in 2017!

Regarding the future, an exciting area of potential for AI is in analyzing the potential impact of change, at machine speed.

2. The importance of feature extraction

OK, this isn’t AI per se, but bear with me because it’s very relevant.

The goal of applying AI and ML to the toolchain is to provide scalable context with minimal manual intervention (the automation ethos). A lot of that context can be derived from meta-data accompanying the data. This session provided some time-series examples.

The first showed how labels provide meta-data in Prometheus’ Text Exposition Format. In other systems, these are also called dimensions.

The second, a StatsD example, showed how appending |g tells us it’s a gauge (as opposed to an autonomic counter), and the namespace contained cpu% we can infer the value will be a percentage.

In a third example, the concept of “feature extraction” was introduced. The example namespace:“checkout.humblepi.cpu%:23|g” was used to show you could extract a service moniker (checkout) a resource id (hostname humblepi), and the presence of the term “cpu” allows the metric to be classified as compute. Once these features have been extracted, they become a sample fodder for the main show, machine learning.

3.The Significance of Clustering

Two examples were given of how ML could be used to cluster based on similarity matching on extracted features. In the first, the service identifier was used to group event and metric data for the “checkout” service. The challenge was that developers introduced slight variances in the identifier. For example: “checkout”; “Checkout”; “check out”; “check-out,” etc.

Now, this could be addressed with a very complex regular expression that took into consideration all the permutations. But there’s one big problem. It only works for the checkout service. What about the other services in our microservices environment? What happens when a new service is introduced? No problem, that’s where Natural Language Processing comes in. By tokenizing the string, we can isolate stem words and compare them using a Dice similarity coefficient. We can include all the variances, not just for checkout, but ANY value for that feature.

Another example taken from an IEEE paper, showed how the same technique could be used to group events from a Nationwide Automatic Teller Machine (ATM) network based on an address. The challenge again here was the mind-boggling permutations of a single address. “51 1st St.”, “51 First Street”, “Fifty One 1st St” . . . you get the picture. In the example, a staggering 36 different permutations were found for a single ATM location!

NLP saved the day again. Not only did it perform better, providing a better F1 score (a measure of accuracy calculated from precision and recall), but it scaled better too, with no changes.

Moogsoft is the AI-driven observability leader that provides intelligent monitoring solutions for smart DevOps. Moogsoft delivers the most advanced cloud-native, self-service platform for software engineers, developers and operators to instantly see everything, know what’s wrong and fix things faster.
See Related Posts by Topic:

About the author


Richard Whitehead

As Moogsoft's Chief Evangelist, Richard brings a keen sense of what is required to build transformational solutions. A former CTO and Technology VP, Richard brought new technologies to market, and was responsible for strategy, partnerships and product research. Richard served on Splunk’s Technology Advisory Board through their Series A, providing product and market guidance. He served on the Advisory Boards of RedSeal and Meriton Networks, was a charter member of the TMF NGOSS architecture committee, chaired a DMTF Working Group, and recently co-chaired the ONUG Monitoring & Observability Working Group. Richard holds three patents, and is considered dangerous with JavaScript.

All Posts by Richard Whitehead

Moogsoft Resources

December 20, 2021

SREs, Observability, and Automation

December 17, 2021

Beyond Monitoring and IT Ops: Understanding How Observability Helps the SRE

December 15, 2021

The Exec's Guide to Embracing Availability: From Tick Mark to SLAs

December 10, 2021

Monthly Moo Update | December 2021