It’s impossible to anticipate every potential IT network incident, but traditional approaches only work for anticipated incidents.
A colleague of mine likes to explain the value of Moogsoft by referring to the ex-US Secretary of Defense, Donald Rumsfeld:
“…As we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know.”
The historical approach to Service Assurance resulted in techniques and products that are based on rules and models. In other words, they rely on known knowns: Once a problem occurs, a rule or a model can be created to catch that event if it recurs. This approach can also help up to a point with the known unknowns: If you have an inventory for all of your systems, in theory when something happens you can follow the branches of the tree back until you find the root cause.
So far, so good. But modern IT infrastructures are too complex and evolve too rapidly for these relatively static approaches to be sufficient. Virtualization means that the various components and devices don’t stay put, but are created, moved, altered and destroyed at very high speed and with little or no human intervention. On top of this infrastructure churn, developers are no longer satisfied with months-long deployment cycles, but expect to be able to do very frequent or even continuous releases in order to satisfy the ever-growing pressure of business requirements.
Catching Unknown Unknowns
In this environment, it is extremely rare for an outage to map cleanly to a single event or failure. Problems begin in one area and migrate and spread quickly or slowly across multiple other areas. By the time events have moved to the point that the problem is visible to end users, cause and effect may be widely separated in time and logical space.
In this sort of rapidly-evolving environment, there are more and more unknown unknowns that can cause problems. By their nature, these types of outages will also take longer to diagnose, as they do not fall neatly into any of the existing categories. They may also span multiple teams’ domains, further delaying detection, diagnosis, and ultimately, resolution.
What Moogsoft’s algorithms do is to catch these unknown unknowns, identifying correlations between apparently unrelated events and across technology and business domains. These events are clustered together without the need for an inventory or a model, let alone a system of rules that addresses each possible contingency. This algorithmic approach is what enables us to deal with change and evolution, supporting IT Operations in its mission to enable the business.
The Fourth Quadrant: Unknown Knowns
To return to the story of Mr. Rumsfeld and his quadrant: while he may have popularised the concept — to the point that it is often known as a Rumsfeld Quadrant — it has its roots in a much earlier cognitive psychology tool, called the Johari Window. In this view, there is a fourth quadrant that is equally important, which in Rumsfeldian terminology we could call the unknown knowns.
This is knowledge and expertise that exists within the IT organization, but is not easily available or accessible to take advantage of. In most companies the complexity of the IT infrastructure is reflected in the complexity of the support organization that is in charge of that infrastructure.
Numerous different teams each have narrowly focused views, whether “horizontal” on a particular layer of the application stack, or “vertical” on a particular business service — but it is very hard to get an overall view of the entire environment. Each team contains deep knowledge about its own domain, but other teams don’t even know how to ask the questions to access that expertise.
These are the Johari Window’s unknown knowns — and these are exactly what Moogsoft can help to bring to the surface and take advantage of.
When our algorithms detect a Situation — a cluster of related alerts — we have already made an unknown into a known, without relying on rules or models. The next step, though, is to invite experts from all affected teams into the Situation Room, where they can collaborate and share their expertise — in other words, making those unknown knowns available and known to the other teams.
About the author
Dominic Wellington is the Director of Strategic Architecture at Moogsoft. He has been involved in IT operations for a number of years, working in fields as diverse as SecOps, cloud computing, and data center automation.