Despite what it might look like from outside the industry, in IT we don’t change things around just because we feel like it. If there were such a thing as Sysadmin Club, Rule One would be “if it works, don’t touch it.” Change is not something to be undertaken lightly — but sometimes it is necessary.
In IT Operations, the main goal is to ensure that everyone else can do their jobs. No company has a goal of operating IT systems; rather, those IT systems are there to support some other business objective. Today, the IT systems and the business processes that they support are so intertwined that it can be hard to identify where one ends and the other begins.
In the last decade, there have been enormous changes in the business climate, and equally momentous ones in the technology world. Unfortunately, the link between those two worlds has not kept pace with the rate of change. Organizations are still managing incidents through outdated approaches such as email. A few months ago I wrote about a then-new Moogsoft customer, and the problems that they had encountered trying to manage incidents in their email inboxes.
At the time, our customer was focusing on the inefficiency of detecting that they had a problem in the first place, and understanding its impact. None of this was particularly unique, and I have shared their experience with many other prospects since then, who have all recognized themselves (more or less) in that situation.
A Bridge Call Over Troubled Water
There is one other technology that comes up again and again when we talk about incident management, and that is the dreaded conference bridge.
If email is used as an alerting mechanism and a crude way of routing those alerts, the conference bridge is used at later stage of the incident’s lifecycle. Once the support teams have worked out that the problem is not just within one domain, but spans several different areas — and therefore different teams’ areas of responsibility — the next step is to get everyone on a phone bridge.
This is often an excellent opportunity for a blame-storming session, but once that is done, people get down to work to diagnose and resolve the problem. This does work, but there are two main problems.
The most immediate problem is inefficiency. To paraphrase a saying from the media industry, 80% of the participants on the call are unnecessary — but nobody knows which 80%. When a major incident is declared, standard procedure is for everybody to dial in to the bridge. Even the worst incident will not actually affect every area, and many systems are only showing symptoms of a cause that is outside their area — but nevertheless, someone has to dial in from each of those teams, and stay on the bridge for the duration.
The longer the call lasts, the more people get added, as more and more systems are affected, and more stakeholders want to be updated on progress. Once the number of participants gets close to Dunbar’s Number, it’s all over; groups of humans simply cannot maintain useful communications beyond a certain size of group.
Who You Gonna Call?
Even when the incident is relatively short-lived, there is a high productivity cost to the organization. Most of the statistics around the impact of interruptions tend to focus on the cost of interrupting developers. This makes sense, because developers tend not to be interrupt-driven, but rather focusing on a long-term task. Surely support and operations people are different, though? Is their entire role not to be interrupt-driven?
Certainly frontline support staff are reactive by definition, but if the whole operations team is purely reactive, there is a major problem. In particular, the higher levels of escalation — when an issue is especially complex or requires esoteric knowledge and skills — are going to require the involvement of people whose main job is not incident resolution. These people may well be working in a flow state that is severely affected by interruptions. At that point, the cost of interrupting a senior DBA becomes very similar to the cost of interrupting a senior developer. After all, these days infrastructure is code, right?
Longer term, another problem with conference bridges becomes visibility. Where at least email has the virtue of being self-documenting, conference bridges are entirely undocumented. Even if they are recorded, the recording is not exactly easily searchable — so when the incident is over and everyone hangs up on the bridge, all the detailed cross-domain understanding that has been laboriously assembled over the duration of the conference call disappears into thin air, leaving a couple of cryptic notes in an incident record as the only trace of its fleeting existence.
This is hugely valuable information, required to support future decisions about how to operate IT and provide support to the business, but there is no good way to capture it and make use of it.
You Used to Call Me on the Telephone
The new model of proactive operations that Moogsoft enables shows us a different possibility.
First of all, instead of everyone reacting to individual events and working on them in isolation, Moogsoft puts those events into context, enabling operations teams to understand the true scope of the problem.
Secondly, instead of adding anyone and everyone to a conference bridge, Moogsoft will notify only specialists whose expertise is required for a particular Situation. This minimizes the number of people involved, and also avoids unnecessary escalations, where frontline staff escalate up the chain of command instead of reaching out horizontally to their peers in other teams.
Finally, as people collaborate in the Situation Room, all of their interactions are captured and made available for further analysis. This knowledge capture can take many forms. One is the identification of recurring situations. When detecting a Situation, Moogsoft will flag any past situations that are similar to the current one, and make those available to users – including the chat log which documents how that past Situation was resolved. All of this capture happens automatically; the only manual step is for operators to tag which was the resolving step, so that their colleagues (or their own future selves!) do not need to trawl through the whole conversation history, but can jump straight to the fix.
In addition, as users work with the events within a Situation, one of the tasks they will perform is to identify what is the root cause of that Situation. Over time, Moogsoft will learn to identify a Probable Root Cause, and start flagging those to operators in new Situations, complete with a confidence rating for each one.
Over time, this capability is also what enables automated resolution, as Moogsoft can initially propose a selection of actions for human operators. As the machine learning process continues and the classification of Situations becomes more and more reliable, that loop can be closed, enabling fully-automated resolution actions. This effectively uses the knowledge captured from operators’ actions to add a “tier zero” to deal with well-known recurring issues automatically, only escalating to a human operator for situations that are out of the ordinary.
All of these are required capabilities for Gartner’s AIOps model:
- Proactive insight
- Intelligent notification
- Intelligent collaboration
- Workflow automation
- Causal analysis
- Decision support
Moogsoft will be at the Gartner Data Center Conference in London on November 28th and 29th, and in Las Vegas on December 5th though the 8th. Please stop by to find out more about how we can help you make the knowledge in your organization more widely available and accessible, and what that can do for your business.
About the author Dominic Wellington
Dominic Wellington is the Director of Strategic Architecture at Moogsoft. He has been involved in IT operations for a number of years, working in fields as diverse as SecOps, cloud computing, and data center automation.