So, this weekend we had a family event. One family member texted two of us separately asking us to collect them from the railway station. We both arrived at the station to pick up the one person.
It was mildly frustrating (and they chose the sporty car over my rather beaten up, and much cooler, old Land Rover Defender).
However, it got me to thinking about the current approach to ITOM and ITSM processes. Specifically how the tools that enforce the ITIL Incident guidelines are very wasteful with our time.
I’m not just having a go at ServiceNow here. We can say the same about BMC Remedy, HP Service Manager, CA Technologies, and a whole gang of other tools we’ve grown to know and love.
So how does this relate to the title of this post? Simple: Agility is a goal.
To achieve agility in IT, it means having a goal to be able to change in line and in time with business requirements and demands. When it comes to infrastructure and applications, we’ve moved from planned change to on-demand change. Importantly, to foster agile transformation, we’re breaking our applications into microservices — so in reality, multiplying the number of discrete applications and the frequency of change. Where infrastructure is concerned, we’ve moved to a microchanges architecture, enabling the infrastructure to avert bottlenecks.
Agility: Then & Now
This transformation in the fundamental architecture of our applications delivery fabric has changed the way incidents are caused.
Consigned to the annals of history is the concept of a single root cause of an incident. Whereas in the past, a single network fault may cause disruption to an application or service, typically now, single faults rarely cause any kind of impact (unless a software deployment includes significant bugs). Likewise, as we move to software defined infrastructures, single changes rarely have any impact on our application behavior.
Today, the causes of incidents that impact our applications and end-user services are a combination of multiple causes, possibly a fault in compute combined with a fault in storage, or a series of microchanges to the compute and network fabric coincident with a database fault.
The consequence of these multi-causal faults is some kind of performance or capacity degradation, which, if we do not act upon them early enough, will lead to some kind of application or service disruption.
And as the relationship between our IT and applications technology domains dissolved, so has the relationship between silos of support across those technology domains.
Sadly, ServiceNow and the other ITSM tools have not adapted to enable the agile digital migration.
Support operations is organized into technology silos. Expertise is organized in layers. There is a lack of situation awareness across silos. ServiceNow cannot support Agile Operations.
- Have you ever wasted time investigating a ‘non-fault’?
- You know, those pesky issues that your monitoring raises as alerts, you investigate, only to find that the issue is not within your technology domain, the root cause is with someone else.
- Ever wasted your nighttime investigating a ‘non-fault’?
- Of course, being woken during the night to investigate is the worst experience for a DevOps or support person…and especially when you find that the issue you have been trying to diagnose was caused by another silo.
- More than one person is investigating the same ‘non-faults’ in difference appservers?
- You realize that you are all wasting your time and your talents.
The Right Tool for the Job
ServiceNow was never designed to improve the efficiencies of incident management processes for you and your staff (ServiceNow, however, has done to Remedy what Salesforce.com did to Oracle). But in relation to incident management, all the ticketing tool does is to help you cover your arse.
The ITSM tool is used to demonstrate that support has not missed any incidents, has actioned all incidents, and met any SLAs relating to those incidents through reporting capabilities. The problem, though, is that the applying the classic ITIL linear Event->Alert->Ticket process to modern, highly resilient and complex application infrastructures makes it increasingly harder to meet SLAs.
Since tools like ServiceNow — which reinforce silos and block situation awareness — do not help reduce the MTTR, some operations teams have resorted to some interesting approaches to meeting SLA averages, by automatically opening and automatically closing a huge number of Tickets in order to bring the aggregate MTTR within the SLA commitments!
Just look at the stats. Several years ago Forester reported that 74% of IT incidents were detected by the end-users before support and operations were aware of the issue.
If one examines the reality of IT Operations Management and IT Service Management working practices that ServiceNow processes forces us into, for example:
- Events are received and converted into alerts that are subsequently converted into tickets.
- Calls/Tweets/emails from customers are received, which are turned into tickets.
- A disproportionate number of those tickets are assigned to application support teams
We end up with a disproportionate number of actionable tickets for application support and, a lot of wasted time.
The problem is that it is difficult for call center operators and Level 1 operations to relate the calls from the customers to events / alerts from the applications and infrastructure, leading to a duplication and multiplication in ticket numbers. Often multiple applications support resources will be investigating the symptoms of the same issue on several appservers, unaware of their peers activities, wasting resources and time, while the real causes are somewhere in the infrastructure. The added problem is that the resolved tickets rarely contain any useful diagnostics advice.
Bottom line: ServiceNow forces us into highly resource intensive working practices which increase MTTR, increase the number of tickets, and ensures that we remain reactive to application and service disruptions.
We need to stop doing what we are doing and take a fresh, situation-aware approach.
Moogsoft AIOps uses machine learning techniques to surface event signal from noise, then infer incidents from the event signal, then drive dynamic teaming and collaborative workflow of operations people.
Ultimately, Moogsoft AIOps provides earlier warning to Incident stakeholders (causal and collateral), reduces the number of actionable tickets by making the appropriate parties situation aware, and reduces the MTTR by providing incident context and knowledge recycling.
Moogsoft enables agile working practices. We’ve got your back!
About the author Mike Silvey
An expert in IT operational management and technology commercialization, Mike launched SunNet Manager in the UK for Sun Microsystems before founding an open systems service management business at Micromuse where he brought several innovative service management tools into the European market (such as Remedy) and established key OEM relationships (Cisco, HP, Intel) that led to successful IPOs for both Micromuse and RiverSoft. Today, Mike is focused on and scaling Moogsoft by overseeing strategic business relationships with key partners around the globe.