Why AIOps? Because End Users Are Your Incident Detection System (Part 1)
Mike Silvey | January 22, 2020
If your end users regularly report issues before your Operations team discovers them, you need AIOps for earlier detection, faster action, and more precise diagnostics.

The competitive pressure on businesses has forced application developers and infrastructure providers to become agile by moving from monolithic to modular architectures – in the case of software, by adopting a DevOps software development lifecycle methodology.

This has massively complicated the support of IT environments, creating the demand for continuous operational assurance and the need for ‘cross silo’ situation awareness.

In this two-part blog, we discuss that specific AIOps techniques are necessary for truly transforming the economics of operations and improving the customer experience by addressing the need to:

  • Identify incidents earlier
  • Make relevant stakeholders situation aware
  • Pinpoint causality and resolution actions

In this first part, we’ll set the scene and offer an AIOps approach for infrastructure operations and support. In part 2, we’ll focus specifically on applying AIOps to DevOps and modular software architectures.

The Sad Reality for Modern Operations, Support and DevOps

Complexity, event volume and blindness, and a lack of cross-silo situation awareness means that end users report issues before Operations detects them — a disastrous scenario for any organization: Customers are already unhappy and Operations has failed the SLA.

It’s all downhill from here. Since end users reported the issue, it’s classified as an “application” or “service” issue, so the ticket initially goes to application support or a software development (DevOps) team. However, the application is rarely the root-cause of the issue.

And so, the application support team wastes time and effort attempting to diagnose the issue only to conclude it’s outside of their realm. Meanwhile, the clock is ticking and costs are rising.

Things typically don’t get better from here. The ticket is escalated across technology support silos until the appropriate team is identified, or an ‘all hands’ war room is urgently convened, as fingers are pointed and blame assigned. The clock keeps ticking and the costs keep rising.

Once the diagnostics effort begins, it’s rarely documented so that similar issues can be more quickly addressed in the future. Why? Pressure to move on to the next ticket.

The ticket is escalated to the next layer of support where that operator unknowingly repeats his predecessors’ diagnostics effort. By now, the mean time to diagnose (MTTD) has doubled, the SLA is in tatters and the mean time to repair (MTTR) keeps climbing.

Even if the second-tier operator diagnoses and resolves the issue, they will rarely document their actions, as they’re eager to tackle the next ticket. This ensures the cycle of escalations will continue.

 

Improve MTTD and MTTR

 

The cost to operations can be represented by the equation below:

Operations Cost = (Resource Cost x Sum(Resource Time))

+ (MTTR x Cost of Lost Opportunity(SLA time))

The Operations and Support “Detect -> Act -> Diagnose -> Resolve” Treadmill

To break down the cost of an ITIL-defined Incident, there are four key areas where one can reduce the costs of resources, the business impact time, and the related consequences, such as dealing with a service- or application-impacting Incident.

  1. MTTDetect — Accelerating detection
  2. MTTAct — Finding appropriate stakeholders
  3. MTTDiagnose — Accelerating diagnosis
  4. MTTRemediate — Accelerating the time to resolve or fix the issue, when and if possible

The Fundamental Issues Affecting the Economics of Operations

The factors involved are related to modern IT innovations, and include:

  1. Single faults are rarely the root-cause to an incident, and the behaviour that causes incidents changes continually because continuous application and infrastructure change leads to continuous change in event telemetry.
  2. In the Cloud and DevOps world, most event telemetry has no severity (importance) ranking assigned.
  3. Most event telemetry is noise or not useful, and there are too many incident alerts from metrics data which do not indicate real incidents. Gartner and Forrester analysts put the figure at more than 90%.
  4. Monitoring agents are difficult to set up and maintain at agile scale for applications, servers, networks and more. They can produce so many metrics that it is difficult for Operations to know which metrics to set alert thresholds on, and whether the threshold is correct or not for a given target / source / time of day / periodicity.
  5. Application performance dashboards do not help DevOps and Operations detect incidents more quickly because operators, SREs and software engineers are flooded with alerts.

The Answer? Apply AI to Operations

The application of AI to Operations can allow reductions in time and resources engaged in the maintenance of services, but the AI should be carefully assessed to ensure that it can deliver the optimum economic value.

As discussed earlier, there are four main areas in the ITIL Incident Management guidelines where AIOps can help transform the economics of operations. I’d like to add a fifth one: Filtering. AI can be applied to the monitoring agents too, to automate the filtering of ‘signal’ (important alerts) from noise in both metrics data and log/state/change messages.

 

AI and IT Incident Detection

 

Illustrated below are the capabilities AI needs to offer operations in order to sustain efficiencies as enterprises migrate to agile working practices:

 

AIOps and Agile Work Practices

 

The aggregation of event telemetry (whether states or metrics) may help Operations by no longer requiring human intervention to rank or rate the importance of the incoming events.

The AI may automatically filter the <10% of ‘signal’ (important) alerts from the >90% of noise events, eliminating the need to store huge amounts of noise in a data warehouse, and to write brittle filtering rules.

The AI may be able to detect incidents as they are occurring, even when that particular incident behaviour has not previously been experienced, ensuring that the system works continuously.

The AI may be able to drive situation awareness, to ensure all stakeholders can act appropriately and efficiently. The causal stakeholders may launch their diagnostics tools or robotics-automation scripts, while the impacted stakeholders may inform their users about the issue.

Some vendors offer value at the monitoring edge, offering dynamic thresholding of metrics data. However, this basic approach leads to too many alerts, which means operators are blinded by spam and consequently either ignore them or become over-reactive. The former reinforces end-users as the incident detection system and the latter adversely impacts developer and operator productivity.

Then there are robotics automation providers, whose systems are trained to spot modelled patterns and execute remediation scripts. Others also add value at the diagnostics stage of the incident workflow.

So Why Deploy Moogsoft’s AIOps Platform First?

It’s all about the economics.

There are two central tenets which underpin the reduction of operations costs in the agile digital age:

  • The ability to continuously ingest event telemetry while it is continuously changing
  • The ability to detect and not miss actionable incidents in real-time while the behavior that causes those incidents is continuously changing

These central tenets are the foundation of the Moogsoft AIOps platform, which lets you slash the three highest costs of incident management: earlier detection, faster action, and more precise diagnostics.

A central concept of the Moogsoft AIOps platform is the Situation: A representation of incidents as they’re evolving. Incidents are pieces of monitored operational data, such as event logs, alerts or metrics, that reflect an anomalous event that merits attention.

Since operators act earlier and the platform’s Situation Room is the system of engagement (rather than a ticket), the operator documents their diagnosis and remediation action. This generates knowledge that helps future operators reduce support-escalation needs and the MTTResolve.

In this way, Moogsoft reduces both the business impact time (often allowing Incidents to be resolved before they become service impacting) and the number of resources engaged in the diagnostics and remediation activities.

 

Continuous Assurance

A breakdown of Moogsoft AI is as follows:

Moogsoft’s patented ingestion AI continually adapts to changing Event data, continually indicating deviating-metrics trends, and evaluating whether a given event message is important or noise. This allows Moogsoft to filter >90% of the irrelevant event telemetry that others force you to store in expensive data warehouses.

Moogsoft’s patented streaming-correlation AI automatically detects incidents across the incoming filtered alert telemetry, enabling Moogsoft’s patented Situation Awareness AI to help the appropriate operations resources to act faster, often before the incident becomes service impacting.

Moogsoft’s patented social collaboration AI enables faster diagnostics, while Moogsoft’s patented knowledge capture and recycling AI enables even faster diagnostics in the future with fewer resource tier ticket escalations.

When all these AI techniques are combined as they are in Moogsoft’s AIOps platform, the MTTR, resource effort and business impact are all significantly reduced, while the TCO of the monitoring platform is reduced and its value increased.

Talking AIOps Total Cost of Ownership and Value

 

Total Cost of Ownership

 

AIOps solutions have a mostly similar licensing cost. Consequently the startup costs are broadly the same, however:

  • Predictive Service solutions can offer a high value, but costs rise along with the number of use cases serviced and the amount of collected data.
  • A similar argument can be made about Accelerated Diagnostics solutions, whose value decreases as deployments grow in size.
  • Ibid for Robotics Automation solutions, which also are slow to show their value because cases for scripting-automation take time to be learned.

On the other hand, Moogsoft’s continuous assurance provides value quickly because it does not need to learn, and enables continuous change (infrastructure, platform and application deployment).

It also offers value at a departmental level, providing benefits like earlier detection, situation awareness and focused diagnostics. At an enterprise level, its benefits include earlier detection across stacks, situation awareness, focused diagnostics, and knowledge capture and recycling.

It provides these enterprise benefits without changing the existing user workflow and engagement tools (ITSM/notification etc.). Its costs stabilize due to the capacity of FTEs to increase workload and the business to continually change.

The Added Benefit of Moogsoft’s AIOps Platform: Lowering the TCO and Increasing the Value of Your Existing Monitoring Tools

An additional value proposition of Moogsoft AIOps is that it increases the value of monitoring and APM tools, and lowers their cost of administration.

You can quickly get an agent running and start producing metrics data. But it’s very hard to have 2,000-50,000 agents running and deciding which metrics to set thresholds on, and what threshold should be appropriate for each target.

Moogsoft can create value from any metric data source, automatically (and continuously) learning the normal operating behaviour window, its periodicity and then automatically creating alerts for deviations from that behavior.

This means that a monitoring or APM practitioner no longer needs to define which metrics to monitor and set thresholds upon, but rather consume them all. Moogsoft’s incident detection AI will automatically show, for example, that CPU has been exceeded on all app servers, combined with an alert (from logs) showing that a new buggy codeline was recently deployed, and that the incident is probably caused by it.

Don’t miss the second part of this post, where we’ll focus on applying AIOps to DevOps and modular software architectures, and explaining how AIOps helps DevOps teams in multiple ways, including by boosting developer productivity, and increasing the CI/CD cycle frequency.

Moogsoft is a pioneer and leading provider of AIOps solutions that help IT teams work faster and smarter. With patented AI analyzing billions of events daily across the world’s most complex IT environments, the Moogsoft AIOps platform helps the world’s top enterprises avoid outages, automate service assurance, and accelerate digital transformation initiatives.
mm

About the author Mike Silvey

An expert in IT operational management and technology commercialization, Mike launched SunNet Manager in the UK for Sun Microsystems before founding an open systems service management business at Micromuse where he brought several innovative service management tools into the European market (such as Remedy) and established key OEM relationships (Cisco, HP, Intel) that led to successful IPOs for both Micromuse and RiverSoft. Today, Mike is focused on and scaling Moogsoft by overseeing strategic business relationships with key partners around the globe.

All Posts by Mike Silvey

See Related Posts by Topic:         

Moogsoft Resources

February 13, 2020

Q&A: AIOps Predictions 2020

February 13, 2020

2020 Is the Year of AIOps

January 25, 2020

Why AIOps? Because End Users Are Your Incident Detection System (Part 2)

January 22, 2020

Why AIOps? Because End Users Are Your Incident Detection System (Part 1)