Coffee Break Webinar Series: Intelligent Observability for DevOps
David Conner | March 8, 2021

A selection of live questions and answers from the audience of our recent webinar on how DevOps practitioners can best leverage intelligent observability to make the most of their precious work time.

A selection of live questions and answers from the audience of our recent webinar on how DevOps practitioners can best leverage intelligent observability to make the most of their precious work time.

Amidst the nonstop pace of work to constantly evolve today’s digital business, we can forget to take a moment out to think about how it is that we’re doing that work. A new series of ‘coffee break’ webinars aim to provide that opportunity by pausing to look at the ways humans can best work with observability data. In particular, Coffee Break with Helen Beal looks at improving the work done by different types of software engineers that leverage artificial intelligence.

The debut episode of Coffee Break — featuring a conversation between Beal, a DevOps ways of working coach, and Moogsoft Chief Evangelist Richard Whitehead— focused on how DevOps practitioners can best leverage intelligent observability to make the most of their precious work time. During the chat, Beal and Whitehead answered questions from the audience and each other about how AI can boost productivity and innovation throughout the DevOps pipeline. 

Based partially on a blog post about a character named Sarah, who encounters a troubling anomaly that threatens her work as a DevOps engineer for an eCommerce site, the conversation between Beal and Whitehead delves into observability practices that harness the power of AI. The result is that Sarah is able to use AI to efficiently detect and remediate anomalies before they impact services and customer experience. 

The full webinar is available to watch on demand, and below is a selection of live audience questions with answers.


What represents good configuration data, e.g. service profiles, that would allow for us to adequately and accurately correlate related data or events? 

Whitehead: If you have any information that can relate discrete components of your system to a specific service, that’s definitely something that you would want to correlate on. In that respect, we refer to that frequently as enrichment

When data comes in, you can enrich that by cross-referencing it against some kind of service catalog or external data. In the old days, it used to be some kind of configuration management database (CMDB). But really, it is anything that allows you to tag that message with something that references it to something else you understand, like an application or a customer or a service that’s been offered. These are definitely things that can serve as the seed to generate some meaningful correlation.

Is there any major move towards the use of telemetry from underlying silicon devices such as CPUs, network interface controller devices or any hardware device that can send telemetry data for data correlation by algorithms?

Whitehead: It's very tempting when you’re using a SaaS solution to forget that at the end of the day every single instruction we’re asking a system to do is being handled by a real CPU. So, whether you’re getting it from your service provider or directly, I think CPU utilization and memory utilization is not only a very important performance characteristic but it’s also these days an economic component as well. Ignore that telemetry at your own risk. It doesn’t go away just because it’s somebody else’s hardware.

What you are presenting here on DevOps, is that not overlapping with AIOps, and are they not used interchangeably?

Beal: DevOps for me is a really broad conversation that started with us wanting IT Ops people to understand what the developers were doing when they were trying to be agile, and has morphed or evolved into something that really is talking about optimizing the flow of the end-to-end technology value stream. 

AIOps has kind of come from a different place. It’s kind of like a superset of monitoring and it is really about applying artificial intelligence practices and principles and technologies to the IT Ops side of the house, in order to assist with some of the goals that we have in DevOps. And specifically one of the other metrics that we haven’t talked about yet, which is MTTR, is really where I think this has been born from. 

Ingesting telemetry data from the myriad of monitoring tools is not a trivial task. Are there any quick wins we could go after with applying any types of AI methods on alert and event data that our monitoring tools trigger based on KPIs?

Whitehead: Everybody likes to look towards standards, but standards have a tendency to proliferate, so I wouldn’t call that a quick win. We’ve seen a lot of very repeatable patterns in on-premises environments and actual data centers where people are using a message bus to aggregate telemetry data, and that definitely helps. It typically leads the users towards coming up with a well-defined schema for the data, and any form of normalization that you can do upfront definitely makes it easier to deal with this vast array of sources. 

Beyond that, it’s really about being disciplined and understanding the concept of labeling and tagging. That’s probably the nearest you’re going to get to a quick win. What I’ve seen is that the bus could be anything, but it’s almost always Kafka. Generally speaking, this kind of aggregation is most valuable when you’re aggregating your telemetry on premises before shipping it to the cloud. 

Is there a video or talk from Moogsoft that explains more about how Moogsoft utilizes AI rather than just what it does or what it delivers?

Whitehead: There are a number of different techniques. Let me describe a few. The one that has the most immediate impact and that you would see earliest in the product is based on adaptive thresholding, which is the ability to look at a time series metric and establish thresholds based on standard deviation, which is a relatively simple but very valuable component. 

From an operational standpoint, the ability for a system to learn the new normal in terms of behavior has benefits. An example would be that you’ve made a change, perhaps a code push, and a key performance indicator goes out of bounds. When that happens, it’s likely to trigger an anomaly. Operationally, you’ve got two possibilities at that point. You can either say “this is not good” and fix it, or you can say “no, this was expected” and just suppress the alert and then let the system adjust to a new normal. While this sounds trivial, the amount of effort this saves you in reconfigurations is really useful, particularly if you’re doing multiple code pushes in a day. 

Another one I particularly like is natural language processing. I alluded to the fact that I’ve been writing regex’s for many years —it’s a necessary evil— and the ability to look for patterns in text strings is incredibly powerful when you’re dealing with human-written log messages, but it really comes into its own when it’s linked to things like tags or labels. If you have got anything like a structured naming convention for services or infrastructure components, that kind of correlation through linguistics is incredibly powerful.

What are the biggest challenges to implementing AIOps, and to maturing and managing alerts more efficiently?

Beal: I think that probably one of the biggest challenges is a cultural one and getting people to work in a different way.

Whitehead: It's a huge factor, yes. The thing about AIOps is that it’s a combination of technology and people. I’ve probably said it a few times but the human is an essential component of this mix. We need to identify what AIOps is, and what it isn’t. Like everything in DevOps, it’s about culture and changing behavior, and also using tools appropriately. That’s the biggest challenge. There are some relatively minor challenges like incorporating legacy code that has not been written with these kinds of technologies in mind, but there are ways around that. Ultimately, culture is the key component.

Coffee Break with Helen Beal will continue throughout March with sessions based on how intelligent observability impacts the lives of IT Ops teams (March 9) and SREs (March 25). RSVP now to save your seat.

Moogsoft is the AI-driven observability leader that provides intelligent monitoring solutions for smart DevOps. Moogsoft delivers the most advanced cloud-native, self-service platform for software engineers, developers and operators to instantly see everything, know what’s wrong and fix things faster.

About the author


David Conner

David is Moogsoft's Director, PR and Corporate Communications. He's been helping technology companies tell their stories for 15 years. A former journalist with the Sacramento Bee, David began his career assisting the Bee's technology desk understand the rising tide of dot-com PR pitches clouding journalists' view of how the Internet was to transform business. An enterprise technology PR practitioner since his first day in the business, David started his media relations career introducing Oracle's early application servers and developer network to the enterprise market. His experience includes client work with PayPal, Taleo, Nokia, Juniper Networks, Brocade, Trend Micro and VA Linux/OSDN.

All Posts by David Conner

Moogsoft Resources

May 5, 2022

More Tools + More People = Increased Complexity

April 26, 2022

Continuous Availability vs. Continuous Change

April 7, 2022

Episode 4: Mooving to… Successful Engineering in the Remote World

March 24, 2022

Continuous Availability: How It’s Changed, and Why It’s Critical