At DevWeek Austin, we discussed how AI and ML have come to the DevOps toolchain and are a great fit! Here are the 3 main takeaways.
1. The Past, Present and Future
AI and ML have made an appearance over the past few years. Firstly, in trying to automate Service Desk workflow and then secondly, Remediation. A more modern interpretation of this is being considered for automated remediation or self-healing. While fully automated remediation and self healing is still a way off, AI can be used to learn the most likely runbooks based on previous operator choices and interactions.
Right now, AI and ML are having the most significant impact on the Ops side of DevOps. When considering the “three ways” flow, feedback, and experimentation, ML in observability and monitoring tools significantly accelerates feedback. In addition to the benefits of rapidly identifying issues in production, an unexpected but welcome benefit of the adaptability of ML means less configuration is required, further accelerating the experimentation.
In fact, AI has made such an impact; there’s even a term for it “AIOps”. It’s a real thing created by Gartner in 2017!
Regarding the future, an exciting area of potential for AI is in analyzing the potential impact of change, at machine speed.
2. The importance of feature extraction
OK, this isn’t AI per se, but bear with me because it’s very relevant.
The goal of applying AI and ML to the toolchain is to provide scalable context with minimal manual intervention (the automation ethos). A lot of that context can be derived from meta-data accompanying the data. This session provided some time-series examples.
The first showed how labels provide meta-data in Prometheus’ Text Exposition Format. In other systems, these are also called dimensions.
The second, a StatsD example, showed how appending |g tells us it’s a gauge (as opposed to an autonomic counter), and the namespace contained cpu% we can infer the value will be a percentage.
In a third example, the concept of “feature extraction” was introduced. The example namespace:“checkout.humblepi.cpu%:23|g” was used to show you could extract a service moniker (checkout) a resource id (hostname humblepi), and the presence of the term “cpu” allows the metric to be classified as compute. Once these features have been extracted, they become a sample fodder for the main show, machine learning.
3.The Significance of Clustering
Two examples were given of how ML could be used to cluster based on similarity matching on extracted features. In the first, the service identifier was used to group event and metric data for the “checkout” service. The challenge was that developers introduced slight variances in the identifier. For example: “checkout”; “Checkout”; “check out”; “check-out,” etc.
Now, this could be addressed with a very complex regular expression that took into consideration all the permutations. But there’s one big problem. It only works for the checkout service. What about the other services in our microservices environment? What happens when a new service is introduced? No problem, that’s where Natural Language Processing comes in. By tokenizing the string, we can isolate stem words and compare them using a Dice similarity coefficient. We can include all the variances, not just for checkout, but ANY value for that feature.
Another example taken from an IEEE paper, showed how the same technique could be used to group events from a Nationwide Automatic Teller Machine (ATM) network based on an address. The challenge again here was the mind-boggling permutations of a single address. “51 1st St.”, “51 First Street”, “Fifty One 1st St” . . . you get the picture. In the example, a staggering 36 different permutations were found for a single ATM location!
NLP saved the day again. Not only did it perform better, providing a better F1 score (a measure of accuracy calculated from precision and recall), but it scaled better too, with no changes.
About the author