Given the complexity and breath of the IT environment in most large enterprises, it’s no surprise that IT operational organizations need to split around domain specialties, and furthermore, by levels of expertise. The escalation processes for incident management across the various domain silos will vary in steps and duration. So if you’re trying to get teams to work better and more efficiently together, you’ll reach artificial limit if your workflow is based on processing individual alerts. To break through, you’re going to need to move to a situation-based workflow. And it’s really a matter of watching the clock. Why? I’ll start with a quick backgrounder and then I’ll get to the details – Time waits for no one, let’s go!
Within event management systems, you have event-/alert-based facilities. These facilities process incoming raw event data, tokenize it, enrich it, and present it to users. In some instances, these systems have to be preconfigured to accept new event feeds, but some can also accept wild carded, generic formats that include the details of the events.
What these alerts are in reality, are triggers that initiate workflow or process actions. In effect, they present to the support operations personnel a piece of information that is supposed to represent the knowledge necessary to respond to a condition, fault, or anomaly.
In many cases, years have been spent developing rule sets to get the facilities somewhat tuned for a given environment. The information fitting is done only for events deemed as actionable. The rest of the events that are not adequately prepared, or categorized as actionable, are discarded.
For example, a typical architectural event flow that can be deployed is as follows:
From the left, managed entities and monitoring tools forward their events to the Event Processing facility. This is typically where the rules are applied to the raw incoming data to tokenize it, transforming it into information.
A middle facility is then used to add in Event Enrichment elements to the event information. Data elements such as processes, services, locations, customer data, etc., all play a very pertinent role in preparation of the event information.
Finally, the rightmost stage is the Event Presentation. Within this, event lists are used to present the event information to users and processes. In essence, this is where events initiate workflow; whether it’s ticketing, run book automation, or manual process initiation.
For larger enterprises with complex IT environments, you realize that the IT Support organization is not flat, but quite structured. As such, a common support model used is as follows:
As incidents are fielded, they often set goals that a very high percentage be accomplished and resolved at Level 1. Incidents that cannot be resolved at Level 1 get escalated to Level 2. Incidents that cannot be resolved at Level 2 get escalated to Level 3.
In some systems, the events transition from one level of support to another works by sequencing the filter criteria for the events. So, an event initially is presented to Level 1. A tool menu item enables the “movement” of the event from Level 1 to Level 2, and likewise, on to Level 3.
Lost in a Time Warp with Event-Based Workflow
Looking at the elapsed time in the aforementioned flow, events enter the system and can take between 30 to 60 seconds to go from processing to presentation. In effect, the time taken from when an event is first created to when it’s presented to humans for workflow initiation can be upwards of 120 seconds. This delay is huge as the time delta, from when an event is first presented from a managed host until the time a person sees it can be skewed by 2 minutes or more.
Additionally, once an event is presented to Level 1, it may take a bit more time for the personnel to respond to the event and start the workflow. As each event is processed and escalated, certain events may take several minutes before they are close to diagnosis and correction.
Now consider this: the event escalation time for network support may be different from database or application support.
And what happens when you have multiple events going on at the same time? Are DevOps folks aware of events related to a network outage? What if the network outage is causing the application to not function?
When you plug in your directed run book automation, do you ever execute on false or side effect events?
Save Time with Incident.MOOG and Situation-Based Workflow
When you look at a process flow using Incident.MOOG, the ingress point into workflow is a Situation. Situations are automated groupings of related events and alerts around anomalous conditions. Events, on the other hand, are discreet triggers depicting error or status condition changes in time. Situations inherently incorporate some level of machine learning, event groupings, and are inherently time conscious.
The huge difference is that within a Situation, you can have mixed event and alert types. Network traps, syslog messages, Nagios alerts, APM alarms, even Email messages – all can be clustered into Situations that have related components. With these related components, the situational awareness of the whole operation comes with it, versus awareness of only discrete events.
From a workflow perspective, you want to initiate tickets based on a Situation (the overall operational condition), rather than on individual events. Situations also have the contextualized narrative to cross multiple support organizations; awareness of all experts is right there and immediately available.
In the figure above, your actual workflow is unfolding in your IT environment as your shift in process continues. The workflow ❶ is related to a database instance performing badly. The workflow ❷ relates to a network outage to Building A. The workflow ❸ is for a service poll that is getting out of bounds. And the workflow ❹ is for an application failure.
In real life, these all may be related or not. In legacy event management systems, it’s rare that one would know or be aware of this. Even in the case of a Help Desk, different services may necessitate the use of Specialists. This is also true of Managed Services Providers (MSPs). As the skill sets become even more specialized, the awareness of things happening outside of the realm of sphere of influence fades.
In the case above, the network outage broke connectivity to a backup database server in the middle of replication. During the course of this occurrence, the binlogs filled up the disk drive on the primary database server. Additionally, the database had a significant number of blocked threads. Part of workflow ❹ was the application that is blocked on the database. And workflow ❸ is the service poll showing increased response times.
Now, when you look at the people involved, the Level 1 person was the same person, or the team is in good communication with each other (if you were lucky). As specialized skills are required, however, the communications across siloes dissipates. As diagnosis and analysis occurs, the urgency goes up for each team.
In the end, you end up with one causal problem with several effectual problems. Yet each must be triaged and treated as causal until the service is restored. It may not be until hours later in the break room, that folks realize the root cause. And in some instances, you want to design in more fault tolerance, capacity, and instrumentation to make things more visible and adaptive next time.
If multiple root causes are involved, this exacerbates post cognitive analysis signatures by changing the event patterns. An interesting and useful feature in Incident.MOOG is the Timeline Analyzer within the Situation Room. When you start to analyze what’s going on, the timeline lets you visualize the cause and effect and changing of state, all over the timelines.
While it’s not always true that the first alert in a situation is causal, it is an excellent place to start. Interestingly enough, in reality event patterns may be overlapped, incomplete, and skewed. Yet the machine learning and clustering algorithms enable you to get a clearer picture of what’s going on.
Within Incident.MOOG, it‘s also relatively easy to cluster on location (as an enrichment), by service, by process, business group, etc. The product can cluster based on relative closeness or likeness in various data elements. But it can also do so within a given time domain.
What Incident.MOOG ends up doing is presenting this “scenario” of sorts as a Situation. It does so based on machine learning algorithms, customizations, and fact comparisons. Within the Situation, it enables collaboration across support centers. In the end, you ticket once, then bring in experts to enable faster problem resolution to help eliminate duplicate efforts. In short, time flies, reducing the duration of the Mean-Time-to-Resolve (MTTR) for each incident.
Legacy event managers are trying to catch up here by trying to insert some level of analytics into the mix. But they still can only “think” in terms of events. And their analysis is static and applied only after the fact. Yet what happens to your “analytics” when your environment changes? Add in VMware vMotion? Or move a server from one data center to the cloud?
Summary – Don’t Be in the Wrong Place at the Wrong Time
If you are initiating workflow out of events, you have pretty much evolved your process optimization for IT Ops as far as you can. Even if you apply run book automation, you’re doing so for processes that have side effects. Your run book, your organization, your processes – all is being initiated in the wrong place.
Stop and think about this. If you are trying to optimize your processes and make your services more effective, how much further can you go today? Can you disrupt and return significant efficiencies in your environment? Isn’t it time now to transform your operations with situation-based workflow?
About the author Rob Markovich