Don’t let your DevOps universe devolve into chaos. Combining AIOps with observability is the answer.
Rockets constantly blast off into space headed towards planets, aiming to create shiny new stars, while meteors whizz by them, threatening their journeys. That’s how global DevOps expert Helen Beal describes the complicated and risky universe of DevOps practitioners and SRE teams.
The rockets are these teams’ frequent code releases. Planets represent customers that benefit from the value — stars — created by these launches. The meteors, of course, are the unexpected IT incidents that trigger problems and outages.
“Our universe is suddenly getting very crowded,” Beal, DevOps Institute Chief Ambassador, said during the webinar ‘Telemetry Everywhere: Observability in the DevOps Cosmos.’ “It’s an incredibly complex environment.”
Thus, DevOps and SRE teams need a telescope that’s “very, very clever,” she said, meaning it must go beyond looking at stars and planets, and instead “turn all this data into actionable insights.”
That’s the difference between traditional monitoring and the combination of AIOps with observability. “We’re helping our DevOps teams by giving them extra intelligence into what’s happening in their environments so they can resolve problems faster,” Beal said.
The AIOps with observability difference
To make sense of this dynamic and challenging cosmos, DevOps and SRE teams need an AIOps platform that covers five key dimensions:
- Data selection, to surface only the important data from all the irrelevant “noise”
- Pattern discovery, to correlate significant alerts and connect the dots
- Inference, to identify root causes
- Collaboration, to facilitate work within teams and across teams
- Automation, to automatically trigger corrective actions
“This gets us to a place where we can perform continuous service assurance,” said Beal, who is also a DevOps and Ways of Working coach.
In this scenario, DevOps and SRE teams can deploy code more frequently and safely, while preventing problems and smoothing out workflows. They turn data into actionable insights, and capture and recycle knowledge, she said.
With increased productivity and freed from onerous manual tasks, they have more time to innovate, release more apps, create more value, delight customers and boost profits.
“The most overlooked competitive advantage is uninterrupted service,” Beal said. “These performance elements help businesses win.”
Helping DevOps and SRE teams evolve and improve
Adam Frank, Moogsoft’s VP of Product and Design, said DevOps and SRE teams can no longer rely on legacy monitoring tools that create static thresholds and thus can’t scale, or generate actionable insights.
“To make sense of that data, to reduce toil and to improve value, you must apply multiple layers of AI, which is why observability needs AIOps. That vast amount of ‘telemetry everywhere’ data needs AIOps,” he said.
It’s with these goals in mind that Moogsoft developed Moogsoft Express, the cloud-native AIOps and observability solution DevOps and SRE teams need for visibility and control of CICD pipelines, continuous software delivery, and business agility and flexibility.
Moogsoft Express ingests and enriches observability data, such as metrics, traces, and logs; reduces noise to detect anomalies; correlates alerts; discovers causality; and facilitates collaboration.
“Moogsoft Express supports the ‘we build it, we own it’ culture of DevOps and SRE teams,” he said.
Watch a recording of ‘Telemetry Everywhere: Observability in the DevOps Cosmos’ to get all the details, best practices, and insights shared by Beal and Frank. The webinar also included a Q&A with the audience. Below are questions the speakers answered in writing after the webinar.
Why do I need this as well as Solarwinds and Splunk and everything I’ve already bought?
Frank: It will help drive more value out of those tools and give you ease of use and insights you struggled to get or couldn’t get at all before. Moogsoft Express is about real-time analysis and insights so you can be proactive and restore services faster.
Beal: It’s great that you’ve invested in your monitoring tools, but you’re probably still drowning in data and feeling some alert fatigue. Our human brains can only process so much data; we have cognitive load constraints after all. It takes effort to pull reports and logs, and to read and interpret them.
Observability and AIOps together can do all this toil for us: it crunches all that data from your existing tools, finds the patterns and makes inferences for us. You can get straight to fixing the problem and skip the part of working out what it is, because that’s already been done for you.
Yes, it requires a bit more money, but you’ll justify that expense with the time you’ll save searching for answers, and that time injects capacity back into your teams that you can spend innovating new value outcomes for your customers.
Don’t forget to hook it into Slack or your ChatOps platform so that your teams can collaborate in real time on your CICD processes through your route to live, and for incident management and blameless/healthy retrospectives.
You mentioned intelligence in Moogsoft Express’ Collector agent. Can you expand on the “intelligence”?
Frank: By intelligence I mean the ML and AI that conducts portions of the real-time analysis. The Collector not only collects the data but analyzes the data in real-time to establish and adapt to the normal operating behavior, predict and forecast the future operating behavior, and generate anomalies when the data deviates from the normal behavior.
Is there anything different I would have to do in order to use Moogsoft Express or practice my monitoring habits before production?
Frank: No, nothing. You should practice observability and monitoring and allow AIOps to analyze the data to surface anomalies and incidents so you can address them before your code goes out to customers.
Are there metrics you recommend starting with?
Frank: Yes, infrastructure metrics. You get these for free. You’ll have to emit your own custom application metrics but infrastructure emits metrics from the CPU, disks, network I/O, memory and more.
When there’s anomalies surfacing from infrastructure metrics you’ll get advanced warning of a potential real issue. By practicing this in your pre-production environment you’ll even be able to code your applications and services to be more resilient to infrastructure type degradations and failures.
What types of data can you process, and can you handle large-scale data sets?
Frank: We process any times-series metrics data, which is essentially a timestamp and numerical value, along with event data, including changes, that derive from your logs, other monitoring systems, or CI/CD pipelines.