While developers are high-fiving at their Nth release of the day, monitoring teams are frantically sifting through the thousands of incoming alerts because that newest release just broke. It’s not uncommon for monitoring professionals to say phrases like “My eyes are bleeding,” and “F**k it just restart the servers.”
I would argue that IT monitoring is so challenging and reactive today because of false assumptions that IT organizations make about the nature of IT infrastructures and human capability.
It all boils down to 3 myths about monitoring…Myth #1: IT Environments can be ModeledA ‘model’ is a representation of the composition and interrelationships of an IT production stack, along with the potential events that could occur. Chances are, your organization has spent years (potentially decades) modeling your environment, with entire teams dedicated to this task. Your organization is likely using IT monitoring and management tools that are reliant upon models. In the ‘90s, this worked quite well. IT environments were relatively static and models were quite accurate and useful for maintaining service quality.
Today, change is endless and rapid.Reality: You Can’t Model What You Haven’t Seen Before
In the (in)famous words of Donald Rumsfeld, “There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns — the ones we don’t know we don’t know.”
We’ve been over this before, and Rumsfeld’s words are relevant here because modern IT environments are highly unpredictable. Change happens on a sub-second basis, and the features and anomalies that occur could be things you’ve never seen before. Therefore the concept of building and relying upon models for managing IT incidents is flawed. Instead of using models, IT organizations need to leverage algorithms to analyze massive volumes of IT telemetry in real-time to understand patterns and anomalous activity.
Myth #2: Humans Know Normal from Abnormal
Unless you have access to NZT-48 from Limitless, we as humans suffer from limitations to our ability to analyze raw information. When presented with information, we only understand patterns and abnormality to a certain degree of granularity. Unfortunately, when looking for patterns and abnormalities across thousands of alerts, humans are about as useful as a bucket of water in a forest fire. I often hear, “We’ve got a highly skilled team that knows our environment inside-out,” yet those same teams occasionally miss suspicious alerts that eventually turn into service-impacting incidents.
Reality: Machines Know Normal from AbnormalWhat takes humans hours to accomplish can be done in just seconds with machines, and with far better accuracy. Machines can analyze millions of events, correlate millions of events, and understand subtle abnormalities in real-time. Algorithms with an ITOps lens can now be applied to massive volumes of monitoring telemetry to dramatically improve the signal-to-noise ratio, presenting humans with a clear understanding of when and why a service may be at risk of impact.Myth #3: A Single Root-CauseIn legacy IT environments (built on mainframe or client server architectures), there were single points of failure. Root Cause Analysis entailed identifying the single fault that caused service impact. Operators could effectively apply models and rules to assist manual efforts in looking for anomalies. Today, every IT organization has adopted cloud architectures, virtualization, failover, redundancy, along with many other technology trends that have dramatically shifted the way that business services function. Furthermore, they have completely shifted the way in which services fail.Reality: Major Outages have Multiple Root-CausesBecause of the rate of change and scale of IT environments today, incidents occur more often than ever before. When a virtual element fails, for example, another virtual element will immediately take its place, minimizing impact to service. The result, however, is that when a major outage does occur, it tends to be contributed to by multiple failures. Standard approaches to Root Cause Analysis assume a single Root Cause and therefore are flawed. Today’s approach needs to correlate anomalous activity and leverage operator-supplied feedback to understand causality at a broader level.Face These Realities, Avoid ImpactBy ridding your organization of these myths and the false assumptions that go along with them, you can improve the way your monitoring teams detect and resolve IT incidents. It is essential to acknowledge that IT infrastructures have changed dramatically over the years, and the way in which you approach incident management must evolve.
About the author Sahil Khanna
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.