How Brittle Rules Frustrate IT Operations
Tuesday January 29 2019
Learn how rules are becoming brittle, weak and ineffective against the enormous rising tide of operational data in a modern IT infrastructure.
This blog is the first in the series, “How AIOps Liberates IT from a Rules-based Approach“.
1. Rules have illusion of simplicity
2. Rules bring exponential complexity
3. Rules do not address unpredictable events
4. AIOps solves these issues
Rules have governed IT operations monitoring and remediation for decades. Those of you who are seasoned IT Ops professionals probably see a rules-based approach as an old familiar friend. As alerts flow into a traditional monitoring tool, the simple logic of “If this condition exists, then do that” addresses each issue with reliable execution and results.
Or does it? Do rules always behave the way we assume they do?
The question posed is serious for it challenges the fundamental assumption of rules-based predictability. Let’s peek under the hood of a rules-based approach to IT Ops and consider how rules are becoming brittle, weak and ineffective against the enormous rising tide of operational data in a modern IT infrastructure.
Illusion of Simplicity
A rule looks like it what it says. It consists of a fixed input and a fixed output. A set of associated rules attempts to address a black-and-white situation without leaving ambiguous, “I’m not sure what to do next” advice. Yet what appears to be straightforward is not always the case. The cause for hesitation is an unpredictable universe of exceptions that always occurs in IT Ops.
The tiniest exception to a rule is a deviation from what that rule was designed for. Exceptions mean the rule’s logic ceases to work. Any result will be 100 percent wrong – until a new rule is created to address the exception.
An analogy affecting all of us is the tax code. On the surface, a law is something for which you either comply or ignore at risk of penalty. But it’s not quite that simple. As with IT Ops, tax laws have many exceptions (“loopholes”). For example, the U.S. Tax Cuts and Jobs Act of 2017 contains 503 pages of new policy. The changes are so big and complex that CPAs are wondering exactly how they may apply in the real world. The U.S. Treasury Dept. is writing rules (regulations) that will answer some of those questions over several years. By then, there will be thousands, possibly tens of thousands of new pages of rules attempting to cover all exceptions to one law.
For IT Ops, a large modern enterprise is getting tens or hundreds of thousands – even millions – of alerts every day. Trying to comprehensively and effectively address all those alerts with a rules-based approach is quite similar to tax compliance. Gray areas will never disappear with rules.
Rules are easy to create. And as noted, you will need to create many of them to address exceptions. This is where IT Ops gets tricky for if you double the rule portfolio from one to two, then you have to know if the two rules are 100 percent consistent with each other. Complexity then rises exponentially as you create more rules for the same set.
Calculating the potential combinations for a rule set is a factorial function, which is the product of all positive integers less than or equal to n (denoted by n!). For example, with five rules, there are 120 possible combinations. With six rules there are 740. Ten rules will generate 3,628,800 potential combinations. And 100 rules may result in nine to the 157th power – that’s nine followed by 157 zeros.
Those examples are tiny compared to combination totals for a typical enterprise portfolio of thousands of rules.
Testing an enterprise portfolio of rules to ensure consistent accuracy is a major issue. Each combination of rules must be verified to ensure avoiding false positive alerts or missing critical incidents. Data scientists call this the “NP Complete” problem because no computer exists that is capable of scaling to this requirement.
Clearly, to say that using a rules-based system for enterprise IT Ops is “simple” would be a misnomer! It’s virtually impossible to understand the effects of alert exceptions in a collection of rules.
If your enterprise IT ops relies on brittle monitoring rules, considering a modern data science-based approach will help instill deep, reliable visibility into what your operational data really mean.
Challenged by Unknown Unknowns
The classic problem of logical induction arises when you attempt to base IT Ops decisions on results generated by a rules-based tool. To frame this problem, consider the Black Swan Theory presented by Nassim Nicholas Taleb in 2007. In applying statistics to trading systems, he addressed the ancient myth that presumed black swans do not exist, which was reinterpreted after black swans were discovered to exist in Australia.
Taleb’s theory offers two relevant points: (1) hard-to-predict, rare events have a disproportionate role in complex operations; and (2) rules do a poor job at predicting the probability of these events.
You can see where I’m going with this. A “black swan event” has become a phrase used in data science for things you can’t predict. Such as discovery of potentially crippling, unusual events in IT Ops. And this is precisely why a rules-based scheme for IT Ops will always leave you on very shaky ground – the very place where your SOC team needs as much certainty as possible to take decisive action.
Beyond Rules, More Certainty
Issues like these are why more large enterprises are turning to data science for clearer insights and more control over IT Ops. With artificial intelligence and machine learning, a SOC team is able to process all IT Ops data in a manner that is tolerant to exceptions without the limitations of rules.
Analyzing all the data is possible with a simple but very powerful characteristic of AI, as applied by Moogsoft AIOps. We use algorithms to know when something is unusual — think unusual feature of the event, not in a statistically-compared-with-everything-we-have-seen kind of way.
Moogsoft’s unsupervised algorithms are similar to how our brains work when confronted with a new situation. Imagine I show you two pictures of a field, one with and without an animal. It’s easy to spot the animal, even if you have never seen that particular animal. Your brain simply processes the visual imagery and can recognise that something that moves, and is a correlated portion of the image unlike the background, is likely to be an animal.
Making a rule replicate this behavior of your brain would require a rule set of detailed descriptions of every type of animal. The rule would also address the exceptions, such as mistaking a car for a black swan. There will also be rules for every possible color for the same type of animal, such as a white swan versus the black ones. Rules are brittle because there is no possible way to satisfy all of these requirements.
The human brain’s methodology for processing data provides a scalable model for AIOps, which uses very similar algorithms so no training is required. The AIOps algorithms are a sophisticated way of spotting collections of your enterprise’s events that are behaving as a related incident, not just noisy unimportant background.
AIOps replaces the uncertainty of brittle rules by quickly and effectively finding event patterns that are a priority for SOC teams – even if results are unanticipated surprises that might otherwise disrupt IT operations.
If your enterprise IT ops relies on brittle monitoring rules, considering a modern data science-based approach will help instill deep, reliable visibility into what your operational data really mean. AIOps is quickly becoming a mandatory shift in approach to help ensure peak performance and integrity of IT operations.
Read the next blog in this series: Understanding the True Cost of Rules
Moogsoft is a pioneer and leading provider of AIOps solutions that help IT teams work faster and smarter. With patented AI analyzing billions of events daily across the world’s most complex IT environments, the Moogsoft AIOps platform helps the world’s top enterprises avoid outages, automate service assurance, and accelerate digital transformation initiatives.