The ITOM and ITSM Trough of Disillusionment

| Tuesday October 3 2017

Disillusionment with AI for ITOps is the result of applying hastily sourced, afterthought technology in an attempt to solve modern network problems

The ITOM and ITSM Trough of Disillusionment

I’ve just read a very interesting blog, ‘CIOs: Don’t rush to AI to solve IT operations management problems,” by Dave Vellante of Wikibon and SiliconANGLE, which backs up, head-on, my point that IBM NOI doesn’t work.

Peter tells us that you shouldn’t believe all the hype you hear about the value of AI added to ITOM and ITSM tools. We at Moogsoft agree whole heartedly.

As we’ve pointed out previously, vendors like IBM that simply tack AI on to their existing brittle tools just don’t offer the value. It’s the wrong approach, people. AI needs to be intrinsic to the realization of desired outcomes, not added on for the sake of jumping on the marketing band wagon. (Eh, IBM?)

Dave points out that the CIOs they have talked to express that their issues are, “IT is often criticized for being too slow to react to operations issues, and that has affected a laundry list of things in the business.”

That is because, as Forrester famously reported, 74% of IT issues are reported by the end users before Operations staff are aware of the issue.

Why is that? There are two reasons:

1. There are too many Alerts for operators to assess. In the traditional Enterprise operations and Telco OSS world, Alert overload, or outsourcing of Alert assessment for cost reductions, means that Operators only assess Critical Severity Alerts. Once an issue is critical, it’s probably already happened, hence the end users calling.

Then look at the new world. Take for example Netflix, which produces 21 trillion (!) events per day but queries less than 1% of that data. The point there is they ‘query.’ That is, after they know about an issue, again, after impact has occurred. None of this is proactive.

74 of issues reported by end-users before operations become aware Forrester

2. Operations are organized into silos of support. When a user calls, which they do, the ticket is raised on the application(s) support team(s). That team (if they are DevOps teams, that means all of the team members) investigates the issue. Remember, the customer / business is already impacted. Most of the time, the application — according to Moogsoft customer surveys and the results of our own Application data — is not the source of incident causality, it is something else in the underlying infrastructure. In fact, Moogsoft’s research customer shows that 86% of the time the application is not the causality of the incident, however the application teams receive the first (and most of the) tickets. Then those tickets are bounced around the silos of support, often resulting in “All Hands” war rooms which sometimes include the CIO!

As Dave points out, “The impact of these and other issues is increased business risk and cost.” Workflow is the problem here. No early warning. No situation awareness. No collaboration.

Dave then goes on to point out that, “Practitioners in the Wikibon community express a desire to be proactive to address these issues, and vendors are promising that their tools will allow them to be more anticipatory… Customers want to reduce false positives and minimize the number of trivial events they must chase.”

In other words, operations teams need to know about situations that will have the potential to impact service — or are already impacting the business or customers, as those incidents are evolving — in order to act earlier, be able to focus the appropriate resources on service restoration (or impact avoidance), and ensure that the stakeholders who may be impacted by the incident are situation-aware in real time. All this must occur while reducing the incoming data down from noise to signal.

The goals of digital transformation

This is exactly how we at Moogsoft look at the world. When Phil and I had the idea for Moogsoft AIOps, we talked to C-level executives of some of the largest corporations to understand their pain and, their goals for their Digital Transformation.

Imagine how surprised Phil and I were back in 2011 before we set up Moogsoft, to find that, although there had been continual innovation in infrastructure (elastic compute, cloud) and application agility (DevOps), the ITOM and ITSM industry simply had not, and has not, remained relevant.

ServiceNow did a great job delivering a more agile architecture of BMC Remedy, reducing the cost of administration of the processes. However that did not do anything to help transform processes and make them more efficient. ServiceNow, BMC, and HP Service Manager et al have reinforced the silos of support and the linearity of process (Alert, Ticket, Escalation, Escalation).

At Moogsoft, we explicitly set out to solve three core problems faced by CIOs who have SIAM today and are engaged in or planning to engage in a Digital Transformation:

  1. Early detection of issues across dynamically changing application and infrastructures
  2. Fewer tickets, and situation awareness across the impacted stakeholders
  3. Reduced MTTR through contextualizing an incident, collaboration, and the dynamic re-cycling of human knowledge
early-detection-signal-from-noise-reduced-tickets-reduced-mttr-situation-awareness-collaboration.png

In doing so, we realized that only machine learning and AI — and more specifically, only the right machine learning and AI applied to the appropriate problems — would be effective in delivering a solution to the problems facing the modern CIO.

Moogsoft Algorithms and Processes

Dave goes on to point out that most issues that impact applications relate to the consequence of poorly applied changes. At Moogsoft, although we agree that this is certainly an issue in traditional infrastructures, in modern elastic infrastructures, public and private cloud, etc. a change is typically not the single root cause. In fact, in modern infrastructures, our tolerance of failure is such that single faults rarely, if ever, cause service or application impact.

The issue in modern infrastructures is when multiple issues occur in separate silos. Unknown to operations teams, these apparently isolated faults in resilient architectures (which includes self healing micro-changes in cloud/elastic IT fabric) across silos can lead to performance or capacity degradations which, if left unchecked, lead to application or service impact…then the customers call!

Dave and the team at Wikibon also discuss prearranging service level agreements in order to meet and exceed expectations.

Again, this is a very practical approach for traditional infrastructures. However, in the case of modern infrastructures and DevOps practices, when it comes to operations and service management, the configuration of applications and infrastructure can change almost minute by minute, and correspondingly, the behavior that causes incidents and service interruptions change constantly.

This means that even the concept of pre-defining service levels through modeling is impossible. Anyway, trying to set any service level that deviates from the expected 99.999% availability is impossible to propose. You would be ridiculed: “What, less than Microsoft Azure or Amazon AWS offers?”

In reality, only an IT operations system underpinned by the appropriate machine learning and AI techniques can offer a viable solution to enable service levels to be met. Yes, something like Moogsoft AIOps.

Moogsoft AIOps Process

Dave and the team rightly propose that IT departments need to “start acting like a cloud provider”  take a pragmatic approach, and design operations and support around the need to sustain constant change and the need to have multiple parties participating in the diagnostics, restore/remediation, and situation awareness processes.

IT needs to learn, and either there is not enough data, or what is learned is out of date by the time the learning has been ‘normalized’ (under- and overfitted).

That’s where Moogsoft comes in. We invested in primary informatics research to solve the key issues:

  • Identify signal from noise
  • Detect anomalies (features or Incidents)
  • Create dynamic stakeholder teams to restore and remediate incidents, and offer impact awareness

We started with a blank canvas and the need to solve the key outcomes for IT Operations and IT Service Management: Early detection, proactive action, situation-aware collaborative remediation. We recognized that there are a plethora of existing systems and processes in place that may be risky to remove or change, or may be owned by third parties and so need to be integrated in order to realize the value and make them proactive.

moogsoft integrated into existing process to make them proactive and more efficient

Even then we were not content. We then made sure that we proved our solution in some of the largest, most dynamic applications and cloud infrastructures out across the Fortune 2000. Take this comparison of the “before” (IBM Netcool) and “after” (Moogsoft):

before and after moogsoft at a cloud scale internet media business

So, Dave is right: Think carefully before you add AI to your existing tool sets.

Choose a solution that was designed from the ground up to deliver business value outcomes, and uses the IP from 37 in-house-invented, patented machine learning and AI technologies to deliver those outcomes.

Join the Herd and experience joy with Moogsoft!

Moogsoft AIOps helps modern IT Operations and DevOps teams become smarter, faster, and more effective by providing technological supplementation that automates mundane tasks, enables scalability, and frees up human beings to do what they do best — ideate, create, and innovate. Start your free trial today by clicking here.

Leave a Reply

avatar
Amol Patil
Guest
Rate article: :
     

Hi Mike, As usual, fantastic article. It is enriching to read you and Moogsoft papers in general. One scenario you haven’t mentioned is about the unknown triggers, passing them on to the human L1/L2 support and updating the knowledge base, re-learning it before ML improves its prediction %; for next time to trigger the automation engine in case of similar fault occurrence.

Mike Silvey
Guest
Spot on Amol. Honestly,my inclination is to write whole novels each time I want to talk about something, however I’ve been coached by my marketeers to try to focus as close as possible to one use case per blog post… groan 😉 I’m planning another blog where I talk about the different types of Machine Learning techniques (Unsupervised, Semi-Supervised, Supervised (Reinforced and Trained) and, how one single technique is not suitable for all use cases or data types. They I’ll explain which techniques we utilize to perform the proactive reduction and workflow, and which techniques to optimize the ongoing workflow.… Read more »
wpDiscuz