Some system failures amount to real-world problems that don’t have a clear yes-or-no answer to correct; for that, there’s AIOps.
This blog is the fourth in the series, “How AIOps Liberates IT from a Rules-based Approach”.
1. Rules-based solutions cannot guarantee to fully determine or decide the root cause of a system failure.
2. The random circumstances of real-world failures often confuse undecidable rules and delay remediation.
3. Unlike a rules-based system, AIOps teaches itself without having to account in advance for every input and output.
We all nod to the maxim’s wisdom, “If your only tool is a hammer, then every problem looks like a nail.” Yet for many IT operations centers, the primary (if not only) tool for ensuring stability and performance is a rules-based service assurance system. Perhaps, in a simpler world, this was enough to do the job. But with the explosion of microservices-based virtual IT in a potpourri of cloud environments, it’s worth asking if that approach is wise. Especially when the problems trying to be solved by rules are undecidable.
Why Rules Are Undecidable
The approach of rules assumes that with enough time and effort, you can eventually find a solution. The goal is to identify the fix you need that will bring a service back on line. Presuming the right set of rules and inputs, you can always predict the correct outcome by identifying the correct underlying cause. However, for complex problems, such as predicting performance of enterprise IT systems, rules fall short. These problems are difficult or impossible to solve with rules.
From a logical perspective, we say these rules are undecidable. In computational theory, an undecidable problem is a decision problem for which it is impossible to construct an algorithm that always leads to a correct yes-or-no answer.
To illustrate, a common and important use case for enterprise IT infrastructure monitoring is known as downstream suppression; it involves a dependency relationship. Consider a simple network with 100 servers. A rules-based monitoring system pings each one to determine if they are alive. A rule states: if a ping is negative, then that server is down.
Ambiguity for rules arises for servers connected by a switch that fails. Monitoring for those servers triggers phantom pings, so that particular rule provides inaccurate information for remediation. You need another rule to ping switches in the network to distinguish what is really down. This scenario shows why simple rules quickly grow into complex rule sets in order to account for all potential variations of related infrastructure.
But what happens if the switch and servers are all in a data center that loses power? A rules-based solution has no way to distinguish between phantom and real failure. The scenario is ambiguous – it’s undecidable because rules cannot work out the most likely cause. IT operators must get information from another source to determine the real problem. This is a very simple illustration, but it captures the fundamental problem: rules cannot guarantee to fully determine or decide the root cause.
Rules Make Real-World Failures Especially Undecidable
The simple scenario gets worse in a big enterprise with thousands of servers, tens of thousands of apps, and millions of virtual circuits – all of which have dependencies on the others. Any of those elements can generate an error. In a rules-based system, there are loads of those un-decidable scenarios.
We saw one occur this month when Wells Fargo Bank’s mobile app and web site became unusable by customers. On February 7, a data center outage in Minnesota killed mobile app and web site usage for the bank’s U.S. customers with fallout continuing into the next day. Customers also reported being unable to use Wells Fargo credit and debit cards as well as company ATMs.
This outage is germane because it illustrates the cascading consequences of a rules-based solution (presumably IBM Netcool Network Management) being confused at determining root cause of an event. The bank’s undecidable disaster started with detecting smoke in the data center, apparently started by routine maintenance. It’s unclear if smoke was actually present. Then power was automatically shut down. Failover was supposed to occur, but took unusually long for undisclosed reasons. The result was national outrage by affected customers.
Dependencies in Wells Fargo Bank’s scenario did not all occur at a fix-in-time event. What was true when the smoke occurred (or a smoke detector malfunctioned) was not true when power shut down. Logic was required to change, as a vast multitude of dependent systems became affected. Just for users to log into the online banking system, required elements that were affected included applications, databases and systems for authentication, authorization, DNS, lots of hardware, and a huge amount of interconnectivity and interdependency. These varying dependencies triggered delays in remediation, and the bank acknowledged intermittent recurrences of fallout.
Rules cannot guarantee to fully determine or decide the root cause.
AIOps Provides More Insight for Correct Decisions
The lesson from Wells Fargo Bank’s outage is a rules-based system is not enough to keep IT systems running. IT operations staff need additional visibility, external information, an understanding of probability and likelihood of underpinning failures, and the most common path to what caused a given type of alert.
Ironically, IT people know about this. Those with rules-based systems are awash in symptomatic alerts that don’t mean anything. For IT Ops, it’s the Grand Operational Challenge. Too many organizations ignore this problem and hope it will go away; meanwhile they endure poor quality of service. Others continue to commit the definition of insanity by doing the same thing over and over, expecting a different result. They triage different rules-based solutions in a hopeless effort.
The visionaries are pushing beyond legacy solutions with AIOps. Instead of grappling with undecidable problems, AIOps steps around the quicksand of baked-in cause ambiguity.
AIOps is independent of the rigid, brittle logic of rules. Instead of forcing IT Ops professionals to manually piece together data that falls outside the limited scope of rules, an AIOps system can ingest all the operational data in your enterprise and automatically apply algorithms to determine which events matter and which do not. Unlike a rules-based system, AIOps teaches itself without having to account in advance for every input and output.
If your organization relies on rules to solve undecidable problems that underpin performance of the company’s heartbeat – namely its IT systems and applications – now is the time to evaluate the wisdom of that path. As the victims have learned in our recent example, having the wrong tool for the real problem is a fool’s errand with nasty results.
Read the previous blog in this series: The Teeny-Tiny Scope of Rules
Read the next blog in this series: AIOps Liberates IT beyond the Antiquity of Rules
About the author
Phil’s passion has been IT operational management ever since he co-founded OTT (better known as Micromuse). Having also invented Netcool and built RiverSoft to a successful IPO, Phil now leads the next big revolution in IT event management with Moogsoft, where he maintains a passionate commitment to innovation, including personally leading the company’s numerous product functions.