You see 700+ new alerts and more flowing in by the second. You see a “System reachability check failed” alert that catches your eye. Then 10 of the same AWS EC2 alerts. After grouping those alerts, you see an “SSH access denied” alert. You stop and think about this for a second, and before you know it there’s 20 new alerts in your email. You lost your prior train of thought, so you start from scratch. This is the reality of IT monitoring that relies on human cognitive capability.
In any form of problem solving, success is essentially dependent on two things: The first is a sufficient quality and volume of information; the second is the problem solver’s ability to process this information and piece it all together.
In today’s world of IT monitoring, the first part is easily achievable. By now, most organizations I speak with are using tools like AppDynamics or New Relic to monitor their application stack — tools like Splunk or ELK for logging, tools like Nagios and Zenoss to monitor their infrastructure health, and many others. Today’s IT organizations have superb visibility across their IT production stacks and a wealth (term chosen euphemistically) of operational data to work with.
The second part is where we struggle, and the culprit is the inevitable limitation of human memory. Psychologists distinguish between different types of human memory — long term, short-term, and working. In the case of IT monitoring, where operators are frantically investigating a storm of alerts from disparate toolsets to understand why a severity-1 outage has occurred, operators are relying on their short-term and working memories.What does Reliance on Human Memory Mean for IT Monitoring?Our short-term memory is what we use to recall information as it is being processed, and working-memory is used to manipulate that information. As anyone who works in IT monitoring knows, IT incidents tend to be unanticipated and unpredictable. While certain event messages may be common, there is a brand new context with each new incident. This means that every time a major incident occurs, operators need to process and interpret typically hundreds of events to understand what really happened.
The issue is that human memory is subject to two core limitations: limited capacity, and limited duration.Limited Cognitive Capacity in a World of Big DataHuman cognitive capacity is a well-studied subject, and the findings are undeniable. ‘Miller’s Law’ (also known as the ‘Magical Number Seven, Plus or Minus Two’)’ explains that our short-term memory can typically hold five to nine items at any given moment in time. It’s now believed that the numbers are actually smaller, especially with text as opposed to numbers.
In the course of speaking with operators and studying impact, we’ve found that, by the time an IT team at a large organization detects a severity-1 incident, there are typically 50+ related events that get fired. How can an operator possibly identify 50+ anomalous and related events from a sea of hundreds in a timely fashion? Well, they can’t.Read Fast, Because Your Memory is, well, Short-TermIn the scenario of a severity-1 incident that is impacting the performance and/or availability of a business service, time is of the essence. Yet, an even harsher time constraint is that of your own memory.
After roughly 18 seconds, the content in your short-term memory begins to decay. Chances are slim than an operator will be able to view hundreds of alerts flooding into their event management console or email client, identify anomalous activity and understand relationships between alerts — all within 18 seconds.
Consequences of this limitation are excessive ticket creation as well as incomplete ticket creation. Based on our interactions with IT ops professionals at large organizations, we typically hear that 70 – 90% of trouble tickets are false, meaning that they are closed without any action taken.Leverage Algorithms to Accommodate Human Limitation
Don’t misinterpret this statement by thinking that algorithms can replace humans and their limitations. Algorithms, purpose built for IT Operations, can be used to do all of the heavy lifting on tasks where humans are inept. This includes analyzing millions of events, identifying anomalies, and identifying complex relationships — all in real-time. By leveraging such algorithms, available in tools like Moogsoft, operators can focus on restoring service as quickly as possible.
I’ll leave you with these wise words:
“Rage with the machine, not against it”
– Tom Reksten, Moogsoft Sales Guru
About the author Sahil Khanna
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.