- They have become more modular.
- They have become more distributed.
- They have become more dynamic.
- They have become more ephemeral.
The combined complexity gradient presents a major challenge to traditional monitoring technology.
Complexity and the Breakdown of Traditional Monitoring Approaches
These four dimensions of complexity mean that IT systems are being composed of an ever growing number of ever more differentiated, autonomous components. Historically, monitoring technology has specialized according to the nature of components or objects being monitored. As networks, servers, storage, and applications were bought and implemented independent of one another, the monitoring technologies themselves fragmented into tools for network monitoring, infrastructure monitoring, storage monitoring, and eventually application monitoring. The rationale behind this specialization was two-fold.
First, each of the component types generated distinctive data types within distinctive contexts. So it made sense to optimize data ingestion technology to suit the environment from which data was being gathered. Second, the components themselves interacted with one another relatively infrequently and in highly predictable ways. Hence there was little need to integrate observations from one component with observations from another component. Certainly such synergy was nice-to-have: to see how one’s IT system operated from end to end. But the truth was: most performance problems of a particular component were caused by rogue state changes occurring within that component, and that component alone. Root cause analysis rarely required looking anywhere else.
Fast forward to the hyper scale and complexity of IT systems today. Modern IT infrastructure has undermined the original rationale for specialization. The number of disparate components has simply grown too large to support component specific monitoring. Not only have architectural layers multiplied, but also the function and behavior of components within each architectural layer (containers or not) have become radically distinct from one another (see Figure 1). The knowledge of how one component behaves cannot be applied to another. The rules for how to interpret the self-describing data generated by one component cannot be assumed for another. Back in the day, four to five monitoring tools ingesting data from a single stack may have been awkward. Today’s ten, twenty, fifty different monitoring tools have become impossible.
Moreover, because of multiplication and differentiation, the interactions of different components have become much more complex. Perhaps ironically, the interdependencies among components have deepened. This means that the root cause of a performance problem in one component frequently (let’s say, usually) originates from a state change of one or more other components. A specialized, component-specific monitoring environment would yield only local information. This data would not provide an IT Operations teams with enough insight to diagnose the reasons for the problem — let alone predict its future occurrence.
Two Varieties of Unified Monitoring
With the rationale for specialization now outmoded, the desire for a unified approach to monitoring has shifted from nice-to-have to a must-have. Vendors have accommodated this shift in two ways.
First, they have provided visualization systems that gather data from a broad array of components and present them together within a single frame. This visual frame is typically organized according to graphical model, the parameters of which the user can adjust. It’s important to note that the model is not intended to provide insight directly into what the data means. Its sole purpose is simply to organize the data gathered so as to make it easier for the user to survey it and generate their own insights. Visualization systems naturally work in real time or near-real time. There is often a replay capability which allows the user to rerun the flow of data across the screen.
Second, vendors have built polyglot data platforms capable of ingesting, storing, and providing access to a variety of data types. These can range from unstructured logs, metrics, and more recently, sentiment data. Generally, the strategy is to gather up data from multiple sources and domains, and then to pour this data into a vast unified data lake. Users are then provided with query languages, visualisation tools, and statistical analysis environments to help them navigate the lake. Unlike visualization systems, polyglot data platforms are based on the application of historical data. Proponents of this approach argue that, whatever the cost in terms of time it takes for users to notice the data, it’s more than compensated by the easy access to historical context provided.
Both approaches are, ultimately, fundamentally flawed. Certainly, each in their own way address the need to collect data from multiple, shifting sources and bring it all together into one place for analysis. They also both recognize a fundamental reality of monitoring modern, complex and dynamic IT systems. Namely, that reliance on domain specific models or topologies to structure the data being collected is of limited use. What both approaches fail to address, however, are the consequences wrought by self-describing data and patterns of modern IT system behavior. Things like noise, redundancy, intricacy, and scale. Put another way, Unified Monitoring is a necessary step in the management of modern IT systems, but it is far from sufficient.
This will require the application of Artificial Intelligence for IT Operations, or AIOps.
Coping with Noise and Redundancy
It is widely recognized that the volume of data, velocity of change, and variety of data types characterize the monitoring data sets that a digital business needs to process. In the case of IT systems management, one must also add characteristics of noise and redundancy. This is a natural consequence of the system architectures supporting digital business being modular, distributed, volatile, and ephemeral. When a system component changes state, it typically must send a signal announcing its change of state to all connected components. Since that signal is essentially the same, no matter the component, most of these signals end up being redundant. A highly modular system not only means many more components, but also many more component interconnections. The result is many more redundant messages that an IT Operations professional needs to deal with as part of their job.
From there, the situation gets even worse. Communication channels are almost always the main sources of data corruption. Since all these redundant signals from all these interconnected components are traversing so many more channels, the probability of data corruption has skyrocketed. Given this, even with Unified Monitoring, merely gathering all this data together in one place will not end up proving helpful. Analysis is required to eliminate the noise and strip out the redundant data before IT Operations can even start to determine whether an incident has occurred, its cause, and the possible consequences. Given monitoring volume, velocity, and variety, such analysis cannot be conducted efficiently and effectively by human beings without the assistance of AIOps.
What specific role does AIOps play here? Once the data has been cleansed, the real work begins. Patterns must be discovered in the data that will provide insight into the behavior of the IT system (see Figure 2). With so many components acting in semi-independence from each other, the patterns that adequate describe system behaviour are incredibly complex. To the human brain, even if assisted by powerful visualization tools, these patterns may look be indistinguishable from random fluctuations.
Why not use statistical analysis tools? Because these tools typically ask the human analyst to choose from a library of comparatively simple probability distributions as a basis for understanding the data. Unfortunately these distributions rarely match the distributions that are actually responsible for the data. In other words, even with cleansed data the IT Operations Management function needs the support of AIOps — which can discover the patterns that genuinely inhere in the data, no matter how complex they may be.
By utilizing AIOps and a Big Data approach, IT Ops is able to gain a deeper understanding of the incident impacts and triggering events in real-time with a Unified Monitoring solution. In times of crisis, having this data helps to:
- Identify the necessary teams to remediate the problem quickly
- Identify the possible root cause
- Enable automated recovery
- Drive down measures of MTT(x)
The Way Forward with AIOps
The bottom line is that Unified Monitoring is indeed critical to tame the modular nature of modern system design. But data unification is only a first step. The next and crucial step is to set the power of AIOps data cleansing and pattern discovery loose on the vast data sets being accumulated. Only then can an IT Operations team effectively support today’s digital business, with its reliance on hyper complex infrastructure and application stacks.
About the author Will Cappelli
Will studied math and philosophy at university, has been involved in the IT industry for over 30 years, and for most of his professional life has focused on both AI and IT operations management technology and practises. As an analyst at Gartner he is widely credited for having been the first to define the AIOps market and has recently joined Moogsoft as CTO, EMEA and VP of Product Strategy. In his spare time, he dabbles in ancient languages.