This blog is the third in the series, “How AIOps Liberates IT from a Rules-based Approach”.
Summary
1. Rules have predictable results in simple environments
2. Rules results are unpredictable in complex environments
3. The scope of rules guarantees they will not work in large IT environments
4. AIOps avoids the limits of rule scope for accurate predictions of emergent behavior
Accurate prediction of future events and behavior is the object of using rules to manage IT operations. Teams lean heavily on the assumption of accuracy. It’s why organizations with a rules-based approach make big commitments to a continuous process of adding new rules, refining old ones, and evaluating outcomes. But there’s a fundamental gating factor for accuracy that remains forever impervious to constant tweaking: the teeny-tiny scope of rules.
In this context, scope is the potential range of IT scenarios that may affect system performance. The scope issue is like a pervasive dark cloud that shades accuracy of rules-based predictions. To understand why, consider an analogy that irritates virtually everyone who commutes to work daily.
What Commuting Teaches Us About Scope
The mission of a metro commute is simple: get to work safely and on time. With minor variations, the process consists of a few simple actions. Get in the car. Start the engine. Check Google Maps for accidents and congestion. Drive the car and avoid hitting hard things. Grab a coffee if there’s time. And finally slip into your desk chair before the boss notices you’re slightly late.
Decision logic for commuting is straightforward because there is one driver, a simple two-dimensional road system, a finite number of routes between points A and B, and set rules for driving. Despite this simplicity, everyone complains that traffic keeps getting worse. Frustration abounds because unpredictable traffic clogs your efficient transit even when Google shows the route as “green.”
None of this frustration would exist if the only driver were you. But a metropolitan area is a large, complex system where the random actions of hundreds of thousands or millions of independent decision-making units (i.e. commuters) produce unpredictable results. If you’re lucky, the result will simply make you late. All bets are off if your car hits something hard.
The intrinsic unpredictability of the commute decision problem is a classic example of what is described mathematically as NP-complete. Math experts define this as computational complexity so bad that no matter how large the computer you can quickly scale a problem past the ability to work out an answer. For example, if you can solve the commuter problem for 10 cars using your laptop, 100 cars would take all of the available compute power on planet earth. You can certainly see that the scope of this commute scenario is too big for a rules-based approach.
Limitations of Rule Scope for IT Operations
For IT Ops, a direct analog to the commute decision problem is the rules used to identify issues that may affect system performance. An individual rule is like the driver in a car. It’s simple and easy to say, “If I get alert X or Y, then the outcome will be Z.” Conclusive, highly predictable results are always available when the scope of operational variables is small (akin to your commute occurring on empty roads).
Enterprise networks are anything but empty. Depending on size and type of business activity, an enterprise network can easily generate millions or billions of event triggers every day. The complexity and potential combination of events generated by this scale of activity dwarfs the decision problem illustrated by metro commuting.
Using rules to manage IT Ops is problematic because you need a separate rule to address each scenario, all of the rules must work together, and the outcome of one rule often depends upon the outcome of many others. With a rules-based approach, you must verify all of that for every scenario to ensure accurate predictions. Otherwise, the only guarantee is results that are unpredictable.
In the simpler example of metro commuting, calculating all the potential results decided by 4.4 million commuters in Los Angeles would be impossible with today’s technology and knowledge of mathematics. It may take a few decades before those catch up.
Doing the same for rules-based IT Ops has the same problem. Both are NP-complete problems. With rules, scope is limited so there is no way to ever know if all their specific results are accurate. Hence the peril of using rules to guarantee IT service delivery.
AIOps Avoids Limits of Rule Scope
Using artificial intelligence for IT Ops instead of a rules-based approach avoids the constraining limits of scope. AIOps allows enterprise teams to eliminate the responsibility of creating rules for every possible combination of events. Using AI and machine learning allows your monitoring system to ingest all the operational data in your enterprise and automatically apply algorithms to determine which events matter and which ones do not. Unlike a rules-based approach, AIOps teaches itself without having to account in advance for every input and output.
The result of using AIOps is greater accuracy in detecting and predicting events that can hamper system performance – and allow teams to keep IT Operations working at peak efficiency. The unique capability of AIOps to treat data as a whole also enables detection of emergent behavior. This is a big problem that cannot be solved with a legacy rules-based tool shackled by a teeny-tiny capability of scope. And as with the mission of metro commuting, AIOps provides a safer path to prevent IT systems from hitting something hard.
Read the previous blog in this series: Understanding the True Cost of Rules
Read the next blog in this series: The Undecidable Challenge of Rules