How AIOps Keeps IT “Cattle” Running Smoothly

When it comes to deploying IT infrastructure for a particular purpose, it is rare these days to deal with the bare metal – or indeed even to have access to it. At Monitorama Amsterdam, Marcus Barczak from Fastly joked that he was fully up to speed with the Nineties – but that is for a very specific application. In most cases and in most industries, the mantra of “cattle, not pets” has been fully internalized. We, at Moogsoft, are of course all about the cattle, but this goes a bit deeper than just our bovine mascot.

To unpack the analogy a bit further, the whole idea is to set up IT infrastructure based on large numbers of interchangeable units – a herd of cattle. Members of the herd can be swapped around with minimal disruption and without outsiders even noticing. Even to the people responsible for the overall well-being of the herd, individual cattle is identified mainly by numbered tags.

In most cases and in most industries, the mantra of “cattle, not pets” has been fully internalized. We at Moogsoft are of course all about the cattle, but this goes a bit deeper than just our bovine mascot.

In contrast, pets have individual names, are lovingly raised by hand, and if they get sick – far from being quickly substituted with any number of identical replacements – they are carefully nursed back to health. IT used to be based around this artisanal, labor-intensive model, with servers provisioned by hand and often by multiple different teams of specialists for specific roles and purposes.

In a pet-type world, monitoring is a matter of regularly checking in with the pets and making sure everything is okay. Any departure from the norm is a matter for concern and will trigger extensive investigation to make sure all is well. In other words, active monitoring, and a one-to-one ratio of alerts to incident tickets, each of which is to be investigated by operators.

By contrast, in a cattle-type setup, the individual members of the herd are interesting mainly in aggregate: a herd of so many heads should produce a certain amount of milk. Data are still being gathered in this mode, often on a much larger scale than in the hands-on world of pets, but the definition of an event that would trigger a call-out to the vet is rather different.

In this world, CIOs’ priorities around monitoring can be summed up based on three primary aspects: Volume. Speed. Accuracy.

Volume

Each unit of infrastructure is continuously producing large volumes of streaming data. In addition, the constant turnover as new elements are added, removed, or reassigned between roles generates its own data stream. Finally, the application layer(s) which rely on all of this infrastructure are producing their own rich data flows. In this model, it is no longer possible to filter or pre-define which events may or may not be interesting in the future, and monitor only those events. Instead, the event management objective shifts to observing what is relevant; a small proportion of the data streams is being generated all the time.

Speed

This is where the cattle analogy starts to break down, as cows are not known for being nimble. However, modern virtualized IT infrastructure moves extremely fast, with rates of change that were impossible when infrastructure was physical and provisioning might require operators to move actual atoms around, not just edit some bytes of configuration. This accelerated rate of change means that no delay can be allowed, as operators need to be able to understand events quickly and react appropriately.

Accuracy

The final aspect is accuracy – events that are forwarded to operators must be real and actionable, constituting an accurate description of the state and behavior of the system. Inaccurate alerting may lead to either false negatives or false positives. False negatives occur when operators are not informed of issues until it is too late; in other words, when a user has already been impacted. Still today, surveys indicate that over half of all IT alerts are first reported by users.

The opposite case is the false positive, when operators are alerted to something that is not a real issue, or is not in their area of responsibility. This variety of incident management spam causes duplicate work and wasted effort, and may also trigger unnecessary escalations to experts, which multiply the impact and cost.

AIOps Helps Manage Huge Data Volumes, at Speed and with Accuracy

The answer to the CIO’s requirements in terms of volume, speed, and accuracy is to adopt real-time algorithmic analysis of monitoring data. Because AIOps performs a first pass of massive noise reduction and significance filtering before ever triggering a response by operators, it is capable of dealing with the huge volumes of data inherent to modern infrastructures: cloud, serverless, IoT, SDN/NFV/5G.

Because it operates in real time, and because it can deliver early warnings of developing issues by identifying correlations even between alerts that have low individual priority or severity, AIOps can deliver the fast, proactive interventions which can mitigate or even prevent impact upon end users. Finally, by bringing together both data and skills in one place, AIOps enables operators to take decisive-resolving action, based on a complete understanding of the problem and its impacts.

In IT, and at Moogsoft, we’re just trying to herd all of the various IT parts together and keep them healthy, productive, and pointed in roughly the same direction.