Achieving the Observability Imperative Requires AI
Will Cappelli | January 25, 2021

Tracking the evolution of IT infrastructure and the tools that monitor it uncovers a need to automate the discovery of actionable insights from increasingly granular data.

Tracking the evolution of IT infrastructure and the tools that monitor it uncovers a need to automate the discovery of actionable insights from increasingly granular data.

The shift to Observability

Over the last six months, unified monitoring, log management, and event management vendors have reoriented their technology portfolios (often without any change to the underlying functionality) towards Observability. In so doing, a fair amount of confusion has been generated in the market. IT Operations and Service Management (ITOSM) professionals, on the one hand, wonder whether the new terminology signifies something truly novel that is responding to new requirements or, if instead, it is yet another attempt to use language and marketing to stay relevant without going through the hard work of technology change. DevOps-oriented Software Engineers (SWEs) and Site Reliability Engineers(SREs), on the other hand, historically at the forefront of the demand for Observability are treating this recasting of what appear to be legacy technologies with a great deal of skepticism. So, the questions naturally arise: is there any validity to the idea of repurposing (perhaps with functional modification) traditional monitoring technology to make them more suitable to the demands of Observability; and what steps should these vendors take to make that repurposing meaningful and successful?

APM, DevOps, and Observability

Let’s first get an understanding of what Observability means in this context. The term itself originates in the mathematical discipline of Control Theory. There it designates a property of the system.  A system is said to be observable if, during the course of its transition from state to state, it generates data in sufficient amounts and of sufficient quality that someone who has access to that data can determine how the system’s states depend on one another causally. So what does that definition have to do with DevOps and the sudden interest in Observability among monitoring, log management, and event management vendors?

The story is a bit complex but it is worthwhile tracking the steps of the history here. In the 2013-2015 time frame, the Application Performance Monitoring (APM) market was at a point of peak influence in the ITOSM community. Business digitalization meant that IT had become critical to business success and applications were the primary link between business and business customers and IT. Hence, the tools developed to monitor those applications came to be considered of strategic importance. These tools, although varying with regard to the details, performed the same general functions. Application event data, end user experience latency metrics, and transaction traces were ingested, usually at intervals measured in minutes, graphed and plotted, and presented to application managers for analysis and problem diagnosis. Unfortunately, at the very same time, market pressures drove businesses to demand much greater agility on the part of application developers and, most importantly, a radical acceleration in the velocity with which new digital capabilities were brought into production. This demand revolutionised thinking about software development and the relationship between development and production environment management. It, also, as a side effect revolutionised the architecture of the applications being developed and the infrastructure supporting those applications. 

First, applications came to be built out of smaller and smaller and increasingly independent components. Second, the speed with which new components were added and old components removed increased by at least an order of magnitude. Third, the boundaries between application-level functionality and infrastructure-level functionality blurred. And, fourth, and most importantly, the rate at which state changes took place within these applications also increased by at least one but, in many cases, two orders of magnitude. In other words, system related events were now following one another in microseconds rather than seconds or minutes. So when the DevOps community evaluated the APM products being used by their ITOSM colleagues, they quickly realised that the space-time scales at which those products operated were far too coarse and slow for the systems SWEs were crafting and SREs were managing. In short, the APM technologies were failing to make DevOps-originated applications observable.

Why Observability became an imperative

And, indeed, the DevOps community was right. Almost overnight, the world had changed and a mini-technology revolution was required to make observability a reality. The first step —and one already taken to some degree by the monitoring, log management, and event management vendors— is to recognize that the data feeds their tools were working with needed to be supplemented (if not replaced by) data feeds with ingestion rates that came close to matching the state change rates of the underlying systems. In other words, the data feeds had to become a lot more granular and low-level, derived directly from the underlying telemetry without any intervening layer of structure. This desire has been largely satisfied by the shift to monitoring systems based upon metrics, logs, and (with some caveats we can discuss another time) traces.

A missing piece?

There is a problem, however, and this has been less widely appreciated by either the DevOps community or by the APM vendors. The time-space granularity issue which renders traditional APM technologies inadequate for the DevOps world also makes reports based on metrics, logs, and traces for the most part unintelligible to even the most knowledgeable, most observant analyst whether that analyst comes from DevOps or ITSOM side of the house. The only way meaningful patterns can be discovered in these data feeds is if some kind of AI or machine learning is deployed. Not only must data be indented at micro-second time scales but insights, likewise, must occur in micro-seconds. (Note, as an aside, that this constraint pretty much excludes the neural networks, shallow or deep, from consideration here. Other types of AI or ML must be brought to bear.) In summary, then, when surveying the Observability tools on offer from the existing vendor community at least two features need to be present. First, the technology must rely on highly granular, low-level data feeds; and, second, the vendor must aggressively deploy AI to seek that patterns in those data feeds. If both of those components are not present, then the vendor is simply not rendering the systems which its tools are monitoring observable.

One last point

One last point is worth making. The observability challenge is not just a matter for the newer systems built by an enterprise DevOps team. The truth is that, in a very short period of time, thanks in a large part to their highly granular and dynamic architectures, these systems have become inextricably intertwined with the coarser, slower traditional systems. Outages and other performance issues which manifest in services directly provided by the more traditional systems can now have their origins in micro-second latency state changes taking place in the DevOps originated code. The implication of this is, of course, that, for all intents and purposes, even the enterprise ITSOM community needs to shift to the new space-time scales brought into existence by DevOps. This means that any monitoring tools that cannot operate in these new space-time scales will not be fit for purpose – even if they are not explicitly targeted at newer applications and infrastructures. Richard Nixon once said of politicians and economists, ‘We are all Keynesians now.’ Similarly, it can be said of all ITSOM vendors, ‘We are all Observability vendors now.’

Moogsoft is the AI-driven observability leader that provides intelligent monitoring solutions for smart DevOps. Moogsoft delivers the most advanced cloud-native, self-service platform for software engineers, developers and operators to instantly see everything, know what’s wrong and fix things faster.

About the author


Will Cappelli

Will studied math and philosophy at university, has been involved in the IT industry for over 30 years, and for most of his professional life has focused on both AI and IT operations management technology and practises. As an analyst at Gartner he is widely credited for having been the first to define the AIOps market before joining Moogsoft as Field CTO. In his spare time, he dabbles in ancient languages.

All Posts by Will Cappelli

Moogsoft Resources

April 29, 2021

Q&A from the Moogsoft/Datadog Fireside Chat

April 23, 2021

New Gartner AIOps Platform Market Guide Shows More Use Cases for Ops and Dev Teams

April 21, 2021

James (IT Ops Guy) and Dinesh (SRE), Petition the CIO and CFO For AIOps Rollout

April 21, 2021

Coffee Break Webinar Series: Under the Covers of AIOps