Employing AIOps for observability, monitoring and service assurance frees developers to focus on building better services.
This is the second in a series of blog posts by Adam Frank examining the role of AIOps in delivering better customer experiences.
In my last post, I explained how and why great customer experiences are dependent on managing complexity & change Now let’s examine three essential truths on the road to delivering great customer experiences.
Those of us who deploy AIOps should understand these three essential truths. First we need to understand our traditional monitoring and its shortcomings. Second is how we perceive actionable information. Third, we must acknowledge that humans cannot pre-specify all decisions and actions. Understanding these points must begin with the initial design of our service delivery and span the entire lifecycle.
1. Traditional Monitoring And Its Shortcomings
Traditionally monitoring tools generate alerts based on simple user defined thresholds or simple state changes. This, in combination with some type of notification system, wakes us when we’re on-call or informs us our attention is required… typically when we’re busy developing cool new customer experiences! These tools typically provide us a signal (e.g.: up/down, disk full, high cpu, high latency, transaction failed, etc.), which is very good for identifying a symptom or symptoms of a problem. It does not, however, identify the root cause of a problem, nor is it proactive.
In fact, this approach is inadequate for the system health of billions of metrics, some of which follow diurnal, weekly, monthly or yearly periodicities. To add manual threshold logic for anything with any type of periodicity would be an absolute nightmare. Static thresholds often lead to:
- false negatives: thresholds that aren’t set tight enough or set too loose for any periodicity or growth, leading to undetected issues for extended periods of time; or
- false positives: thresholds that are set too tight and alert constantly, leading to alert fatigue
We tend to alter thresholds after this occurs, only to discover down the road that it wasn’t set correctly yet again. All of these alerts, events, signals — be they false negative or false positive — derive from time-series metrics, logs, traces: our observability and monitoring data. Once we receive this data, deeper analysis is needed to understand the problem and root cause.
2. The Need to Perceive Actionable Information
So we receive a notification informing us our attention is required, and start our deeper analysis. What has changed? We take into consideration the time of day, week, month and year.
For example, an online shopping cart service has significantly more priority if it fails on Black Friday than it does on a random Tuesday afternoon in March. Now imagine tax services failing in the middle of tax season as compared to in summer or autumn. Not good.
We ask, are there new dependencies? Could there be world influences? What’s the user sentiment? Can we determine an inference based on the cues we’ve observed?
We conduct all of this analysis by opening dashboards. We begin looking at the alerts and how they are related, how they comprise an incident. What signals or cues do each of them provide us? We start to run ad-hoc queries looking at the deeper internals of the system which come from our logging, distributed tracing and metrics. We attempt to understand the context and causality and determine the root cause.
Did we find a cue? A cue is a piece of information or circumstance that aids the memory in retrieving details, leading us to action. We need to find that hint or indication about how to behave in particular circumstances, BUT the knowledge and expectancies a person has will determine what counts as a cue, and whether it will be noticed at all.
Here’s the kicker. While we are doing our investigation and analysis, while we are troubleshooting and observing – looking at our dashboards – we most likely have a handful of other people doing the exact same thing without even realizing we’re all working on the same problem. We need to incorporate the cues and observations from others into our analysis. We need to collaborate. But in order to collaborate, we need the context and correlation to even obtain the understanding to collaborate.
3. Humans Cannot Pre-specify All Decisions & Actions
Now hold on a minute. Pump the breaks.
Isn’t safety coded in the design? Can’t outages be avoided with more automation? Can’t procedures be more objective and comprehensive?
Yes, absolutely, when done correctly. These along with resilience, chaos & cognitive engineering, which is also known as “cognitive performance”, will absolutely help lessen the cracks in the customer experiences we are providing.
But we also need to understand rules are ALWAYS underspecified and unknown, so therefore cannot guarantee context and outage avoidance without interpretation. In addition, events in our environment require decision making and action taking that cannot be pre-specified by humans.
So how can decisions be made and actions taken if humans cannot pre-specify? Can we triangulate our own cognition and make inferences using mathematics? Can we make sense of our dashboards? Can we automate our incident resolution workflow to assure the quality of all customer experiences?
These are the exact promises of the diagnostics that AIOps applies.
Applying AIOps for Continuous Service Assurance
AIOps infers meaningful context and probable root cause by: understanding human cognitive processes and inferences, applying statistical calculations and algorithms to data discovery & analysis, and employing multivariate analysis to understand the significance, impacts and relationships.
So, the answer to the big question posed is Yes. We can triangulate our own cognition and make inferences using mathematics, and then automate incident resolution workflows to assure the quality of each customer experience.
Moogsoft starts with analyzing time-series and metric data, including telemetry (log, metric, trace) streams. These help to surface anomalies, both individual and environmental combinations and clusters. Then we combine the feeds from all the observability sources to separate signal to noise and eliminate non-significant events from the logs, metrics and traces. This process leaves meaningful alerts to form clusters by inspecting the textual, temporal, impact and topological metadata, forming a contextual incident and making sense of your dashboards.
As the identity of incident root cause is asserted and feedback applied to a neural network, a cycle of learning begins to predict the most plausible root causes in newly-detected incidents. Now that there’s context and cues, collaboration is encouraged to resolve the incident. The collaboration and resolution are recycled and fed back in to calculate similarity in past incidents, giving an even better starting point for the next time a similar incident occurs.
This allows a continuous learning cycle to improve operator efficiency. It allows focusing more on developing customer experiences and less on operating them. Ultimately, it assures the quality of the customer experience.
Digital transformation can be better managed by employing AIOps all throughout observability, monitoring and service assurance.
Allow AI to operate, and you are free to focus on building better services for better customer experiences!
About the author
Adam Frank is a product and technology leader with more than 15 years of AI and IT Operations experience. His imagination and passion for creating AIOps solutions are helping DevOps and SREs around the world. As Moogsoft’s VP of Product & Design, he's focused on delivering products and strategies that help businesses to digitally transform, carry out organizational change, and attain continuous service assurance.