Moving Beyond Monitoring: The Case for Observability and the Role of AIOps
Tuesday August 21 2018
Observability and causality are connected, and the Moogsoft AIOps Platform enables enterprises to precisely observe system behavior.
If you look at recent discussions across the web, the concept of observability has been reduced to list of data types — logs, metrics, and traces — which a monitoring system must ingest in order to support the needs of a modern IT operations practice. This demotion, unfortunately, misses the point and it’s not just a question of semantics. If our goal is to deploy a monitoring capability suitable for IT operations then being able to ingest these three data types is far from sufficient.
Nor is it even necessary. The concept of observability is deeply intertwined with the concept of causality and the goal should be to deploy monitoring systems that lead to effective causal analysis of the systems being observed. Observability and causality are connected, and the Moogsoft AIOps Platform enables enterprises to precisely observe system behavior.
Observability is important to those IT operations practitioners who are tasked with supporting digital business, and there are architectures and algorithms required to achieve true observability. Let me explain.
Digital Business Processes are Sequences of IT System State Changes
Moogsoft has rethought monitoring, or better yet, revisited the approach that monitoring is intended to perform in a manner far more appropriate to the complexities, velocities, and criticality of the modern IT estate in its support of digital business.
To provide context for the innovations that Moogsoft is delivering to the market, it is helpful to review some ideas that are making the rounds in a number of IT-focused communities. The concerns about the limitations of traditional monitoring technologies are widespread and some pundits and industry thinkers have proposed replacing the goal of monitoring IT systems with the goal of observing IT systems. These same opinion makers then often go on to define observability in terms of ingesting three data types: logs, metrics, and traces. Now, using a broad array of data types is indeed a good thing. However, using such a broad array of data will not bring an IT operations professional closer to understanding the actual sequence of system state changes which underlie the execution of a digital business process.
Ingestion of data that is ultimately redundant brings forth little value about the health of the business process or the underlying IT system supporting that process.
The concept of observability first arose in the context of mathematical Control Theory. It starts with the idea that we are interested in being able to determine the actual sequence of state changes that either a deterministic or stochastic mechanism is going through during a given time period.
Now, the problem is that, in many cases, we do not have direct access to the mechanism. We cannot directly observe the state changes and so cannot sketch out the sequence. Instead, we need to rely on data or signals generated by the system (and perhaps the surrounding environment) and then, follow some kind of procedure to infer the state-change sequence from the data.
Note that the ability to go from data set to the state change sequence is a property of the mechanism itself or, at worst, the mechanism and its environment. We say, then, that a mechanism is observable precisely if it and its environment generates a data set for which there exists some procedure which allows for the correct inference of a state change sequence that executes while the data set is being generated.
Traditional monitoring systems have not been concerned with observability. They have focused on capturing, storing, and presenting data generated by underlying IT systems and have left it largely up to human operators to make any inferences regarding what the data reveals about the underlying IT systems. This was sometimes masked by the concurrent use of topology diagrams, CMDB data models, and other attempts to represent the IT estate but it is important to keep in mind that, in most cases, these models were either developed manually or generated via some other procedure that was independent of the data being ingested.
In the best of circumstances, these system representations provided interpretive context for the data captured by the monitoring system and sometimes, the data could lead to modifications of the system representations but there were no algorithms and processes that led directly from the data to the system representations.
Now, many newer technologies are trying to take the situation to the next level. Not only do they ingest data but they also actively seek out patterns and anomalies in the data they are ingesting. Nonetheless, they still fall short of enabling true observability of the systems they are concerned with. Why? Because the patterns and anomalies sought after are precisely statistical properties of the data sets themselves. They are not attempts to move beyond the data to the system that generated the data. Put another way, the patterns and anomalies have to do with normalities of correlation and occasional departures from those normalities. They do not capture the causal relationships which support the actual state changes within the IT system itself.
A Few Words on Causality
Let’s spend a few words on what I mean by causality, how it differs from correlational normality and how all of this is connected to IT system state changes. Think of two events captured by two data items — say, for example, the usage of a CPU standing at 90% and end user response time for a given application clocked at 3 seconds. When one occurs, the other occurs. Recognizing the fact the both events always accompany one another as reflected in the data would be an instance of a correlational normality. It is not, however, necessarily an instance of a causal relationship. In order for causality to be at work here, it would have to also be shown that an intervention allowed one to lower the level of CPU usage to, perhaps, 80%, it would impact the response time in some way, maybe shortening it to 2 seconds. In other words, a causal connection between two events is demonstrated by showing that an intervention changing one event will result (without further intervention) in a change to the other event.
This is a problem, particularly when it comes to establishing causal relations that link IT system events and by extension, digital business process events. Few businesses will take kindly to a piece of enterprise management software “conducting an experiment” on a production system just in order to establish causality! Let’s suppose for a moment, however, that there is a way to achieve knowledge of causality from the data generated by a system. System state changes are, in fact, events linked by causality and a given sequence of such changes is, in fact, a causal relationship. Causality depends upon fact that an intervention regarding one component in an environment brings about a change to another component in that same environment. Now, if there is a mechanism that moves an IT system from one state to another, there must be some way modifying the first event so that that the second event will also be modified as automatic consequence. Hence, if causality can be established or, at least approached closely, an understanding of system state changes will have been obtained. In other words, the system will have been observed and not just monitored.
Intervention, of course, is the great stumbling block. I will post a follow up to this note outlining some of the techniques through which a knowledge of causality can be approached that gets around the need for intervention and make the case that many of those techniques lie at the heart of the Moogsoft AIOps Platform.
Moogsoft is a pioneer and leading provider of AIOps solutions that help IT teams work faster and smarter. With patented AI analyzing billions of events daily across the world’s most complex IT environments, the Moogsoft AIOps platform helps the world’s top enterprises avoid outages, automate service assurance, and accelerate digital transformation initiatives.