In the first article in this series I explained the fundamental tradeoff of site reliability engineering (SRE, which also stands for site reliability engineer) – the tradeoff between reliability on the one hand and deployment velocity and cost on the other.
My colleague Charles Araujo, in the second article, explained how observability differs from monitoring and how it is essential for the work of site reliability engineers.
In this article, in turn, I discuss the missing piece of this puzzle: automation. Automations certainly empower SREs to free up their time to focus on more valuable activities – but there is more to automation than a straightforward increase in productivity.
As enterprises implement cloud-native solutions at increasing levels of scale and velocity, SRE becomes an increasingly difficult challenge. Observability tools are certainly necessary to address such challenges, but by no means sufficient.
Without adequate automation, operations teams simply have no hope of keeping up with increasingly dynamic, scaled out technology deployments.
Automating the Right Things and Automating Them Right
Why do we need humans in the roles of SREs, anyway? Shouldn’t we be able to automate all of them out of their jobs, implementing ‘lights out’ data centers instead?
Certainly, an IT shop with no people in it wouldn’t require lighting, but there are a number of reasons why this extreme view of IT operations automation is unrealistic.
Automations deal best with the ‘happy path’ – that is, the expected behavior of the system being automated. SRE, in contrast, focuses on just those situations where the behavior of the systems under management are not behaving properly.
As Araujo explained in the last article, observability gives SREs insight into the unknown unknowns – unpredictable incidents that by their nature don’t lend themselves to any kind of proactive planning, including automation.
Secondly, modern IT environments are complex enough that problem incidents may not have single, clear-cut causes. Instead, multiple interrelated causes are at the root of many issues, and thus the solutions require a measure of creativity on the part of SREs.
The most important limitation of automation, however, is the fact that the most valuable work of SREs is more proactive and strategic: thinking about ways to prevent problems that involve multiple systems and people.
For SREs to have the time and energy for such strategic efforts, however, it’s important that crises don’t cause them to spend all of their time firefighting. Reducing such firefighting, fortunately, is well within the scope of automation.
How Moogsoft Handles Automation
Moogsoft has long applied AI (machine learning in particular) as well as advanced correlation in order to detect incidents, even before they occur. In fact, the company has been applying AI for service assurance since before Gartner coined the term AIOps.
Today, Moogsoft also applies its expertise in AI to automated knowledge capture and recycling. This capability notifies operators of past incidents that are similar to current ones while providing all relevant information necessary to resolve those incidents.
Moogsoft’s automation capabilities go further than these AI-driven examples. Moogsoft also supports an API-first strategy for custom integration that support its automation capabilities. Enabling custom integrations are with REST API and webhook (custom callbacks) is straightforward.
These APIs, in turn support automations by empowering operators to retrieve and update data from other applications and systems, including statistics for reporting and dashboards.
In fact, Moogsoft’s APIs form a superset of its UI capabilities. Operators can perform every action in the UI via these APIs, including configurations, adjusting monitoring thresholds, changing workflows, etc.
Moogsoft also provides multiple inbound APIs: Metrics APIs where operators can send time-series metrics, and the Events APIs for receiving JSON-formatted events.
In addition, the platform integrates orchestration and run books, making them available to operators, either via partial or full automation.
Enterprises have been using run books for years, but the challenge they present is knowing which run books to use in which situations. Moogsoft leverages machine learning to make the appropriate recommendation of the choice of run book.
The most important aspect of Moogsoft’s automation strategy, however, is the fact that it is focused on augmenting the work of SREs and other operators, rather than simply taking tasks off their plates, as the illustration below shows.
Automation in Moogsoft (source: Moogsoft)
As the figure above shows, once Moogsoft identifies important incidents and ranks them by severity, it then automatically coordinates interactions with various personnel. The goal of this collaboration-centric approach to automation is not to take people out of the loop. Rather, its intent is to take ‘grunt work’ – less valuable, easily automatable tasks – off the plates of ops personnel, giving them more time to focus on the more challenging aspects of incident resolution.
Such a strategy goes hand-in-hand with Moogsoft’s support for observability. From the perspective of SREs, they have automation-supported collaboration and control over the ops environment while simultaneously having the visibility they need into network, system, and application behavior necessary to resolve existing problems and to prevent new ones from occurring.
The Intellyx Take
The diagram above is but a snapshot in time of how automation can drive incident resolution. But make no mistake – over time, the environment in the illustration will grow more complex, as performance and delivery velocity requirements become ever more onerous.
Given this dynamic state of affairs, automation becomes increasingly essential, and will become an ever more important part of incident management. Human limitations can easily become a bottleneck, as there are only so many hours in a day and so many qualified people any organization can hire.
The writing is on the wall. The companies that will succeed in the future, as cloud-native computing and edge computing hit their stride, will be the ones that can best leverage automation to resolve and prevent incidents from slowing them down.
Copyright © Intellyx LLC. Moogsoft and ServiceNow are Intellyx customers. None of the other companies mentioned in this article is an Intellyx customer. Intellyx retains final editorial control of this article.