Picture this… You get into the office at 8:30am and open your email. The first one that catches your eye is from an executive at your company and has the subject line **Website down: Fix ASAP**. You say “Oh shit,” and open your alert feed.
You see 700+ new alerts and more flowing in by the second. You see a “System reachability check failed” alert that catches your eye. Then 10 of the same AWS EC2 alerts. After grouping those alerts, you see an “SSH access denied” alert. You stop and think about this for a second, and before you know it there’s 20 new alerts in your email. You lost your prior train of thought, so you start from scratch. This is the reality of IT monitoring that relies on human cognitive capability.
In any form of problem solving, success is essentially dependent on two things: The first is a sufficient quality and volume of information; the second is the problem solver’s ability to process this information and piece it all together.
In today’s world of IT monitoring, the first part is easily achievable. By now, most organizations I speak with are using tools like AppDynamics or New Relic to monitor their application stack — tools like Splunk or ELK for logging, tools like Nagios and Zenoss to monitor their infrastructure health, and many others. Today’s IT organizations have superb visibility across their IT production stacks and a wealth (term chosen euphemistically) of operational data to work with.
The second part is where we struggle, and the culprit is the inevitable limitation of human memory. Psychologists distinguish between different types of human memory — long term, short-term, and working. In the case of IT monitoring, where operators are frantically investigating a storm of alerts from disparate toolsets to understand why a severity-1 outage has occurred, operators are relying on their short-term and working memories.
What does Reliance on Human Memory Mean for IT Monitoring?
Our short-term memory is what we use to recall information as it is being processed, and working-memory is used to manipulate that information. As anyone who works in IT monitoring knows, IT incidents tend to be unanticipated and unpredictable. While certain event messages may be common, there is a brand new context with each new incident. This means that every time a major incident occurs, operators need to process and interpret typically hundreds of events to understand what really happened.
The issue is that human memory is subject to two core limitations: limited capacity, and limited duration.
Limited Cognitive Capacity in a World of Big Data
Human cognitive capacity is a well-studied subject, and the findings are undeniable. ‘Miller’s Law’ (also known as the ‘Magical Number Seven, Plus or Minus Two’)’ explains that our short-term memory can typically hold five to nine items at any given moment in time. It’s now believed that the numbers are actually smaller, especially with text as opposed to numbers.
In the course of speaking with operators and studying impact, we’ve found that, by the time an IT team at a large organization detects a severity-1 incident, there are typically 50+ related events that get fired. How can an operator possibly identify 50+ anomalous and related events from a sea of hundreds in a timely fashion? Well, they can’t.
Read Fast, Because Your Memory is, well, Short-Term
In the scenario of a severity-1 incident that is impacting the performance and/or availability of a business service, time is of the essence. Yet, an even harsher time constraint is that of your own memory.
After roughly 18 seconds, the content in your short-term memory begins to decay. Chances are slim than an operator will be able to view hundreds of alerts flooding into their event management console or email client, identify anomalous activity and understand relationships between alerts — all within 18 seconds.
Consequences of this limitation are excessive ticket creation as well as incomplete ticket creation. Based on our interactions with IT ops professionals at large organizations, we typically hear that 70 – 90% of trouble tickets are false, meaning that they are closed without any action taken.
Leverage Algorithms to Accommodate Human Limitation
Don’t misinterpret this statement by thinking that algorithms can replace humans and their limitations. Algorithms, purpose built for IT Operations, can be used to do all of the heavy lifting on tasks where humans are inept. This includes analyzing millions of events, identifying anomalies, and identifying complex relationships — all in real-time. By leveraging such algorithms, available in tools like Moogsoft, operators can focus on restoring service as quickly as possible.
I’ll leave you with these wise words:
“Rage with the machine, not against it”
– Tom Reksten, Moogsoft Sales Guru
Get started today with a free trial of Incident.MOOG—a next generation approach to IT Operations and Event Management. Driven by real-time data science, Incident.MOOG helps IT Operations and Development teams detect anomalies across your production stack of applications, infrastructure and monitoring tools all under a single pane of glass.