Learn how the process of maintaining rules is complex, costly, risky, and can actually impede incident resolution in some cases.
This blog is the second in the series, “How AIOps Liberates IT from a Rules-based Approach”.
1. Rule maintenance is expensive
2. Rules have hidden complexity and cost
3. Rules can hinder detection and remediation
4. AIOps solves these issues
Business is rightfully obsessed with costs. Cost of goods sold. Cost to market and sell goods. Product development costs. Compliance costs. Cost of using employees versus contractors. And so forth. For IT operations, the pressure 24×7 is to ensure the entity avoids costs of sluggish systems or downtime, which impede conducting business and booking revenue.
But what about the cost of rules? Many organizations rely on legacy monitoring systems that use large sets of rules intended to spot potential issues that would prevent optimal performance. You could say those enterprises live or die based on the rules. It’s worth exploring the cost of rules because an enterprise should know if the cost is too high.
Not a Bargain
A rule is a simple concept. It states a fixed predictive input and a fixed predictive output. Takes but a moment to jot one down, which sounds inexpensive. Seems like a bargain compared to other processes.
We know the bargain ends after writing the first rule because it covers just one situation. The endless variation in issues for enterprise IT system performance requires a different rule for each option. Each rule must be checked for consistency against all the other rules in a set. As I described in the first blog in this series, the number of combinations grows exponentially to levels where no computer exists that is capable of scaling to this requirement.
Understanding the true cost of rules entails the never-ending process of creating, checking, and revising all the rules. This is a habitual maintenance problem of gargantuan size. It’s like painting the Golden Gate Bridge – by the time you finish one pass, you’re late starting a new coat over dangerous ever-recurring rust.
Rule maintenance requires keen insight on the interactions and nuances of a rule set. The technical knowledge and operational experience needed for effective maintenance is not a skill set possessed by junior SOC staffers. Be prepared to focus your senior (i.e. expensive) SOC experts on this one.
The severe shortage of experts puts organizations in a double bind because there is no practical or cost effective way to reliably maintain all the rules. The number of rules has exploded with proliferation of modular, distributed application and device virtual instances that pop on and off on demand. When rules don’t work as intended, or when there are conflicts between rules, their accuracy suffers and the SOC becomes inundated by irrelevant alerts. To address alert fatigue, SOC analysts often pull back from using rules for root-cause correlation of events. Well-meaning analysts may even turn rules off, although this action makes them more reactive than proactive to system issues. The result is incurring higher costs of poor availability and downtime.
Restricting Rules Provokes Risks
The temptation to turn off some rules exists because in a typical operational environment, often less than 10 or 20 percent turn out to be critical. Needless alerts occur especially when SOC analysts over-attribute severity to particular alerts. Often this attribute is hard coded into rules. Trying to deliberately avoid irrelevant alerts may seem efficient on the surface, but the cost will often become onerous.
Suppose the SOC decides to filter “irrelevant” rules at the source in order to process only critical alerts. The Achilles heel of this strategy is that most severe outages do not start with a critical alert. Instead, the issue usually starts with a low-severity incident. The monitoring tool reveals just the hint of a new problem, and if that hint’s indicator is turned off, the issue will be undetected. By the time an incident turns critical, analysts will never have seen the issue coming. It will be too late.
Consider a common example that can seriously hamper performance of a microservices-based system. Suppose a Kubernetes DNS error triggers a service failure. This severity is hard-coded into the rule. Perhaps that rule makes sense for the particular microservice. But its failure is not automatically the cause of performance hits to other microservices, which usually are consequential. Automatically elevating severity for a particular event may misdirect SOC responses at an early stage. It’s how binary rules can easily lead you down the wrong path to remediation.
Rules reward you with hidden complexity and cost. Going in, rules are attractive. They look simple, and naively, proponents will claim rules are far simpler than AI. They look predictable; what is more straightforward than true or false? Indeed, areas of science such as chemistry, genetics and life are based upon very simple physical laws. But while scientific laws are a direct analog of true or false, their scale creates enormous complexity. For a modern enterprise, rules offer a false economy that encourages stepping backward.
Using AIOps eliminates the typical costs associated with rules and does a much better job at ensuring system performance.
AIOps Reduces the Cost of Rules
The inherent messiness of maintaining rules carries potentially incalculable costs when SOC analysts are unable to detect and remediate issues that hamper performance. The scientific answer to issues posed by a rules-based approach is AIOps – using artificial intelligence and machine learning to solve the problems that rules are supposed to handle (but cannot).
AIOps eliminates the need to create rules for every possible combination of events. Instead, an AIOps system can ingest all the operational data in your enterprise and automatically apply algorithms to determine which events matter and which do not. Unlike a rules-based system, AIOps teaches itself without having to account in advance for every input and output. Using AIOps is essential for ensuring peak performance of modern systems.
For more technical background, I invite you to read an academic analysis of how AIOps with machine learning is superior to a rules-based approach in our study, “Cookbook, a Recipe for Fault Localization“ (paywall). It was published in the NOMS 2018 IEEE/IFIP Network Operations and Management Symposium, so IEEE members can pick it up here.
Using AIOps eliminates the typical costs associated with rules and does a much better job at ensuring system performance. Some people would call that a “twofer” – and that’s a real bargain!
Read the previous blog in this series: How Brittle Rules Frustrate IT Operations
Read the next blog in this series: The Teeny-Tiny Scope of Rules
About the author
Phil’s passion has been IT operational management ever since he co-founded OTT (better known as Micromuse). Having also invented Netcool and built RiverSoft to a successful IPO, Phil now leads the next big revolution in IT event management with Moogsoft, where he maintains a passionate commitment to innovation, including personally leading the company’s numerous product functions.