In a modular system, there are costs associated with consumption, observation, and remediation. Together these costs act as a tax on the agility of IT teams and systems.
An agile system must be highly modular. It must consist of many distinct components which are capable of responding to and acting upon the environment independently of one another. This allows the system to deal with a diversified and rapidly changing world and, at the same time, allows developers to easily and rapidly modify the system to cope with even greater anticipated diversity and change. Unfortunately, modularity comes with a number of costs.
The Costs of Modular Systems
First, there are resource consumption costs. Modularity often requires that the code and data are replicated across multiple modules and, as a result, means that processor and storage costs will be greater than they would be in a less modular setting. It also means that more time and energy will be spent on executing a given task since a specific computation carried out once in a less modular setting will be carried out multiple times. In other words, to support the execution of tasks in a modular system, an enterprise spends more on space, time, and energy.
Second, there are observational costs. Precisely because the components of a modular system act independently of one another, it is extremely difficult to infer the state of one component from the state of another component. Put another way, if you want to understand the behavior of a modular system, you will need to gather data from each component separately and cobble together the full picture from these isolated data sets.
A large percentage of data being ingested, in fact, is redundant as much as 90 – 99 percent. Put another way, 90 – 99 percent of your storage-related costs is being spent on data that is more or less worthless.
At the same time, the amount of redundancy in the data generated by the system tends to increase because each component ends up recording environmental context information shared by a large subset of other components. Take time, for example. In a highly modular system, each component will carry its own clock with it and, in most cases, the time it is recording and stamping against its autonomous states will be shared by many of its near neighbors, if not all components in the system. In a less modular setting — a setting with far fewer components — there will be far fewer time recording instances.
So, on the one hand, a greater level of pruning is required if one wants to effectively observe the system and store that information in an economical way, assuming that the data would be gathered from across the system to ensure the observation is complete. These factors ensure that the cost of observation is greater than it would be in a less modular setting.
Then there are remediation costs. Causality in modular systems can be extremely complex. If the behaviour of an overall system does not go according to plan, it is usually the result of multiple failures taking place among the components that constitute it. The modularity itself militates against single points of failure and, while that may mean greater resiliency in general, it does mean that when failure occurs, diagnosis and fixing the root causes of the problem become more difficult and costly.
Together these costs act as a tax on the added value attributable to the agility enabled by aggressive use of modularity in system design. Of course, it is difficult, if not impossible, to assess the magnitude of this tax but we do know where to look to ensure that the magnitude is as small as possible.
Dealing with the Cost Components
The resource consumption costs are pretty much a lost cause since they are a direct consequence of what it means to deploy a modular design. One can always require that developers exercise greater care when it comes to crafting individual components but fastidiousness about resource consumption works against the overall mindset of the modular system designer. Requirements along this dimension will, at best, be ignored and, at worst, distort the drive for agility which should be the system designer’s fundamental concern. That leaves the observation and remediation costs as candidates for minimization.
The level of observation costs is a function of two factors: first, the amount of data that must be consumed in order to identify an event and, second, the costs of the resources required to ingest the data and to make the event identification. Now, the amount of data that must be consumed is itself an almost inevitable consequence of modularity and the only real way of reducing this cost factor is to deploy algorithms that are optimal in their ability to go from large highly redundant data sets to smaller information rich data sets on the basis of which events may be identified.
Traditional rule-based deduplication algorithms, although popular, are generally quite poor at redundancy reduction. This is particularly true in modern environments which change so rapidly that most rules purporting to weed out redundancy become outdated days, hours, minutes, seconds, after deployment. It is, in general, much better to take the path that Moogsoft has taken with its Entropy algorithm, which uses a mathematical function that works in real time on various properties of the data items themselves to prune down the data streams. As I said, however, large data volumes at the point of ingestion will not go away, no matter how clever the algorithm, so major cost savings will only be obtained if one can work some miracles regarding the actual resources which ingest the data and make the event identification.
There are basically two ways of dealing with the data ingestion and event identification problem. The first is to capture of all of the data, store it in a database somewhere, and then set algorithms to work on the captured data set, separating out the redundancies. The redundancies themselves may be kept or tossed out but whatever their fate, they no longer encumber the data being worked with so that events — hopefully at that point — are easily identified. The second way examines the data as it streams past the point of observation. While it streams, measures of redundancy are accumulated and then at various points, highly redundant data items are removed from the stream and the result is passed on for further analysis. The only storage involved is some kind of cache which holds onto the data while the redundancy metrics are being counted up. The cache is then flushed to make room for succeeding data items in the stream.
The first approach ratchets up the costs in a number of ways. Firstly, there are the opportunity costs attributable to the time it pull the data from the source where it is first gathers to the location of the database where it is stored. It is true that this time, in most real world applications, is measured in seconds but the world is speeding up. On the one hand, truly digital transactions (think algorithmic trading) take place in micro-seconds. On the other hand, the infrastructures supporting such transactions are increasingly being built from components (think containers) that have life times measured in micro-seconds.
So for both application and infrastructure layer events, the passing of seconds before an event is meaningful and acknowledged can mean the failure to prevent significant outages or performance shortfalls. Secondly, there is the sheer cost of storage — whether one is talking about the storage space itself or cost per gigabyte ingestion that characterizes many of today’s popular big data platforms. To underline how much of an issue this is one must remember that a large percentage of what is being ingested, in fact, is redundant as much as 90 – 99 percent. Put another way, 90 – 99 percent of your storage-related costs is being spent on data that is more or less worthless.
The second approach suffers from neither of these expenditures. Data is processed on the wire and does not need to wait to get to the storage and, while it is true, that historically, the data did have to be pulled from its source into a centralized stream before pruning and pattern discovery could take place, technologies that are streamed rather than database focused are pushing the analytic operations closer and closer to the source.
How Moogsoft Handles the Agility Tax
Moogsoft provides an excellent example of such an evolution. The Moogsoft AIOps platform, while concentrating its analytic fire on data streams rather than databases still requires data to be ported from its sources to centralized servers where the AI is actually applied. No unnecessary storage costs or time debts are built up due to the latency of getting the data into a database and, indeed, with certain types of data such as log and event data, many of the critical patterns that need discovery are global in nature. One wants to understand, for example, how an event taking place within a JVM physically located in London is related to an event taking place in a database physically located in São Paulo. Nonetheless, where the patterns of interest are almost entirely local — as is the case with most metric and time series data, precious micro-seconds are needlessly lost and some communications costs are incurred while the data is transported to the locale of analysis.
With Observe, our new product announced at the recent Moogsoft AIOps Symposium, we have pushed the AI out to the source where the data to be analyzed is generated. Specializing in time-series and metrics data, the Observe architecture allows an enterprise to take the biggest bite possible out of the observation cost dimension of the agility tax.
With regard to the remediation dimension, the automation of root cause analysis for complex environments would appear to provide the best way to drive costs down as a far as possible. Given the large number of products that have claimed to support root cause analysis (and such claims have been being made for at least 20 years,) it would appear as if the remediation dimension is relatively easy to navigate through. Unfortunately, this is not the case as most of the claims about root cause analysis are misleading or just wrong. They tend to fall into two categories.
Misleading Claims About Root Cause Analysis
The first category confuses topology visualization with root cause analysis. The software that leads to this category of confusion provides the user with a visualization of network or infrastructure topology and associates events with the nodes from which they originate. Based upon very simple rules — for example, if node A is connected to nodes B and C and nodes B and C are not directly connected to one another and if events occur at A, B, and C, then the event at A can be said to cause the events at B and C — such software uses topological information to support causal inferences.
Of course, no direct understanding of causality is involved. There is just topology and event placement but the technique does, in basic settings, give an idea of what events may be causally related to one another. Problems start to emerge when topologies become very complex (as is increasingly the case with most real world topologies.) On the one hand, determining which nodes are connected to which nodes both directly and indirectly becomes computationally prohibitive (i.e., it would take a computer the size of a planet to actually generate such a matrix of connections. On the other hand, when the topological patterns become complex, our intuitions about what constitutes a causal path are rather weak. If an event appears at a node after events appear at many nodes to which it is connected, many of which are connected to one another, intuitions and topology alone will not help us tease out what causes what. So topological visualization does not, in and of itself, deliver effect root cause analysis for modern environments.
The second category confuses correlated anomaly detection with root cause analysis. The idea here is for the software to ascertain what constitutes a normal pattern of behavior in a number of different domains and then watch for simultaneous anomalies or departures from those normal courses of behavior. When such anomalies are spotted and the user is alerted, the user calls on his or her domain knowledge to determine which of anomalies is likely to be the signal of an underlying root cause. Although from a computational perspective, simultaneous anomaly detection is superior to topological visualization, it is probably even less informative when it comes to helping out a user determine which, in a complex array of events, are indeed the root causes. For example, the various anomalies may represent multiple unconnected events or, more likely, they represent the common effects of some underlying set of root causes.
The sad truth of the matter is that, despite two decades of claims regarding support for root cause analysis, most monitoring and event management technologies have provided suggestions based upon correlation and topology and have left the job of root cause analysis up to the IT Operations professional. From the perspective of the agility tax, this has meant little to no reduction. Furthermore, as IT systems have become more complex, the utility of the suggestions provided by the monitoring and event management technologies has declined precipitously.
Real Automated Root Cause Analysis
Moogsoft’s AIOps Platform, by contrast, has taken significant steps towards true automated root cause analysis. The analysis proceeds in two stages. First, data items from the information-rich data stream described above, including the anomalies plus context worked up by Observe, are correlated according to their relations in Time, their relative Topological positions, and their distance from one another based upon content of the Text strings that describe them. The output of this stage are “packages” of correlated data items.
The next stage is where the automated root cause analysis properly takes place. Each package of correlated data items is examined by two algorithms. The first is a neural network-driven learning algorithm that investigates similar packages over time and determines, on the basis of relative variations over time, which data items are likely to signal the presence of events that cause events signaled by the presence of other data items in the package. The second algorithm is based on graph theory theorems discovered by Moogsoft CEO Phil Tee concerning a topic called Vertex Entropy.
Recall that we discussed the computational burden placed upon traditional topology visualization applications by the complexities of modern topologies. As indicated, this computational burden is a consequence of the need to determine in advance how every node in a topology is connected to every other node, a space-time complexity burden that grows exponentially with the number of nodes in the topology.
And it gets even worse. Understanding causality frequently involves some kind of understanding of the length of the shortest path between any two nodes in a graph. (Causality usually involves an optimal transfer of energy from one node to another and a key aspect of that optimality is the length of the path between the node where the cause event occurs and the node where the effect event occurs.) Unfortunately, we know that the minimal path length problem is NP-complete which means, more or less, there is no avoiding a computer the size of a planet to compute such results for even a relatively small topology.
Now, algorithms based upon vertex entropy circumvent both issues in an excitingly novel (and patent-pending) way. It turns out that one can determine how connected a node is to other parts of a topology based purely upon local conditions (examining, for example, the pattern of connections a node has to its immediately neighboring nodes.) The results of this determination are, in fact, what is called the entropy of the node (or the vertex, the more graph theory friendly term).
With vertex entropy calculations in hand, we can get an excellent picture of the overall topology (since we know the degree of connectedness of each node) but we can build that picture node by node and not worry about the exponential number of links that connect the nodes. In other words, the computation grows linearly with the number of nodes in the topology, not exponentially. With regard to the second driver of computational cost, Vertex Entropy shows which nodes are the most important nodes in the topology, i.e., those nodes which are most likely to be the sites where causally significant events. As a result, shortest path analyses become superfluous. All one needs to know is that there is SOME path connecting a node with high vertex entropy to a node where an event of interest has occurred. If an earlier event has occurred at the high entropy node, the probability is very high that the latter event is among the root causes of the event of interest. So with the vertex entropy algorithm, Moogsoft is able to provide powerful automated root cause analysis without requiring a computer the size of a planet.
In summary, the agility tax, although difficult to measure, is a very real cost which enterprises must pay if they want to obtain the agility required by the demands of digital business. Two of its three components (observation and remediation) can be significantly reduced through the judicious application of AIOps functionality. But not all AIOps portfolios are the same. The patented algorithms by Moogsoft are the only ones that I am aware of that is capable of significantly reducing the agility tax without requiring vast computational resources.
About the author
Will studied math and philosophy at university, has been involved in the IT industry for over 30 years, and for most of his professional life has focused on both AI and IT operations management technology and practises. As an analyst at Gartner he is widely credited for having been the first to define the AIOps market and has recently joined Moogsoft as CTO, EMEA and VP of Product Strategy. In his spare time, he dabbles in ancient languages.