I leave my very comfortable job as a pre-sales engineer at the company that I’d idolized as a student and then as a software engineer: Sun Microsystems. There I had been fortunate along with other folks in Europe like Mark Peach and Nasya Bennacer to join a tiny, crazy little SPARC clone builder. Together we built a new Service Assurance business around a product from a new startup in the Bay Area called Remedy Corporation.
Remedy was co-founded by Dave Mahler, the innovator behind HP OpenView Network Node Manager, and Larry Garlick, the former VP Networking at Sun and inventor of SunNet Manager, the world’s first SNMP management system. Our team brought trouble ticketing to IT client service in Europe with the Remedy Action Request System, now in v9.1 and still supported by BMC Remedy.
This was before ITIL. Before the ITSM category.
Over the nearly 30 years of this solution’s existence, many competitive ITSM ticketing products have come and gone. New products have appeared along the way. Some have even grown to market dominance, like ServiceNow.
ServiceNow’s innovations over Remedy were twofold. First, they actually have a database schema—who would’ve thought!? Second, they are SaaS. In all other respects however, ServiceNow essentially recreated Remedy in the cloud.
Today, as we arrive at 2020, neither solution has improved total cost of ownership. Neither ServiceNow nor BMC Remedy enable cost reduction of incident workflow operations, especially egregious since customers spend a considerable amount on configuration consulting …but I digress.
No New Innovation in Trouble Ticketing Systems
Here’s the rub: there has been no new innovation in trouble ticketing technologies over the past 30 years.
Controversial statement? Perhaps. But that statement is perfectly justifiable. To wit:
- Tickets are ‘reactive’ documents
- Tickets are singular; relating to a single source (before reacting, just keep reading…)
- Tickets enforce linear workflow
- Tickets do not represent context
- Incident tickets are singular; relating to a single issue
Ultimately the use of ticketing systems for ITSM does not save money. In fact, it increases the total cost of IT Operations & Support (O&S). It also increases mean-time-to-repair—that is, the business impact on your customers. Loss of reputation can be hard to come back from.
Let’s not blame the original designers of the trouble ticketing system. Remedy, Peregrine, and ServiceNow all designed products that were right for that era of IT. Just as we did at Micromuse with Netcool at that time.
The problem is that trouble ticketing systems add no value today, in the current IT era. Like Neanderthals, ticket systems never evolved, so their trajectory must be extinction.
Trouble Ticketing Was Built in the Client-Server Era
The O&S world that legacy trouble ticketing systems were invented to help organize no longer exists.
Stating the obvious, it’s not 1991 anymore. Back then:
- Single faults caused service impact.
- The transition from modular compute to client-server meant an explosion in network devices, compute, storage, etc.
- Support people were deeply technical—and deeply flawed— heroes who, if you dared approach them at all, would use choice language to deem your problem not worthy of disrupting their time… and “it’s probably a network issue” regardless!
O&S for client-server open systems was VERY unorganized and lacked accountability. IBM mainframe users came from a very organized and accountable support world. In migrating to PCs, they were greeted with support anarchy and—truth be told— a lot more service interruptions back in those days.
So something had to be done about it. What better than a trouble ticketing system that would run on client-server systems and support client-server PC users?
The trouble ticketing system brought process governance and accountability to client-server computing. It allowed for SLAs to be agreed upon and reported on. It underpinned open systems computing, allowing it to go mainstream.
And there were loads of them. Most of them were swallowed up. Remedy was acquired by BMC. Peregrine became HPE ServiceDesk. Clarify went to Nortel, and Vantive to PeopleSoft. Many still exist today, or have only recently been retired.
Limitations of Legacy Trouble Ticketing Systems
Don’t worry. There is a point to this blog.
To return to my original premise… Trouble Ticketing is Dead!
Let’s put a finer point on it. Use of legacy trouble ticketing systems like BMC Remedy and ServiceNow for O&S of modern compute and modularized software is D-E-A-D.
We’re on the cusp of the year 2020. Show me someone, anyone, who has replaced BMC Remedy with ServiceNow Trouble Ticketing and ended up with O&S workflow that is more efficient.
The reality is, although ServiceNow may initially demonstrate a reduction in administration, configuration and maintenance costs, it has NEVER been shown to reduce O&S costs.
Companies that have swapped BMC Remedy for ServiceNow have not really benefited from huge savings. Why? Because the cost of inefficient O&S resourcing, plus the cost of business impact, far outweigh the cost of administering a ServiceNow’s Trouble Ticketing system. Once ServiceNow has been heavily customized, it actually becomes a TAX on the business, without delivering O&S value… but I digress again.
Some may push back on this argument saying, “Hold on a minute. Trouble ticketing systems are necessary.” They process the support tickets initiated by end users when their service or application is impacted. They fulfill simple service requests like password changes.
Arguably there is a case for the latter, but not for the former. Standard ticketing systems like ServiceNow are totally inappropriate for reporting service interruptions because that increases the workload of the wrong O&S people. Namely, the applications support team!
Anatomy of an IT Incident Ticket
How can I claim that ServiceNow has never shown a reduction in O&S costs? It’s illustrative to review the landscape of modern IT incidents.
- Single faults rarely cause business impact. With this being the case, a system that relates a single fault as a single ticket offers very little value.
- IT incidents are mostly caused by two or more coincident behavior changes. If not caught early, these anomalies lead to some form of service disruption. Trouble ticketing systems offer next to no value because:
- Trouble tickets are not designed to support multiple “ticket topics”, so can’t impart cause and context to the responder.
- Tickets are designed for a single owner/assignee, so cannot offer situation awareness across a stack of stakeholders, both causal and collateral.
- Tickets are raised because:
- An alert has exceeded a particular threshold. In reality, many alerts will exceed their thresholds. This may not be relevant to the periodic behavior of the attribute in question, but it leads to many tickets, more cost, and absolutely no value. The ticket becomes a petri dish to grow resource costs.
- An end user(s) has reported an issue with the behavior of an application. Typically, this leads to multiple tickets across multiple application teams, yet all relate to the same issue. No value. Just a petri dish growing the cost of spiraling resources and SLA penalties.
Our surveys show that up to 86% of the time, the application is not the cause of incidents that occur on monolithic software platforms. However, tickets get routed to application support teams first, most of the time. They then waste a significant amount of diagnostics time proving that their particular application is not the cause, only to escalate the ticket on to support teams in other technology domains.
The High Cost of Trouble Tickets
How can ServiceNow actually reduce IT Operations costs in the modern IT world when it’s a system that…
…Provides no context on the issues causing an incident;
…Enforces a direct 1:1 relationship between an application/device and its apparent causal item, such as an alert or end user report;
…Supports a responder workflow which is both linear and non-collaborative.
Simple answer: it cannot.
This relationship between a reporting item (e.g. an alert or user), the target device or application, and case management workflow based on single ownership results in higher costs across three dimensions:
- The cost of missed SLAs or ruined reputation, where the end user is the reporter of a given issue.
- The cost of “phantom diagnostics”, where multiple application teams waste valuable resources chasing the same issue across multiple tickets.
- The cost of business impact(s), where diagnostics and remediation resources are wasted as tickets are escalated to other technology support domains and higher tiers of expertise.
DevOps & Trouble Ticketing
Up to this point, we’ve been dissecting trouble ticketing as it functions under legacy ITSM, in enterprises that strive to be ITIL compliant.
How about within more agile, continuously iterative DevOps environments? DevOps doesn’t have a need for trouble ticketing, right?
In point of fact, DevOps already has trouble ticketing. They call it PagerDuty or Slack or Jira or WhatsApp. Enterprises playing at DevOps within select teams often use Microsoft Teams or Skype.
As your organization makes the transformation from monolithic to modular software, consider this. DevOps support is not only the Wild West—more heroes who don’t communicate well with non-technical people—but they also endure the same finger pointing as application support teams when something goes amiss.
The modularization of modern software means that an issue caused by buggy deployment of new code leads to API consumers being impacted downstream. ALL DevOps teams react the same to an issue perceived to be with their service. They waste time on diagnostics proving it isn’t theirs, wasting developer productivity in the process.
DevOps teams suffer all the same issues as their legacy peers and predecessors:
- Too many pages/tickets/notifications
- A lack of situation awareness between DevOps teams and cloud/infrastructure services
- End users as the incident detection system
- Tickets passed from team to team until the guilty one takes responsibility/ownership
Trouble Ticketing Sorely Lacks Knowledge Capture
In the modern IT environment of elastic data centers, hybrid clouds, virtualized networks, and modularized software, the ticketing system has yet another MAJOR cost center to bear which is not addressed.
The life of a systems operator is often a sad one. It’s not much better for an IT support rep. Always acting outside promised service levels: demoralizing. Never having time to document what you fix: demoralizing. Wasting time diagnosing a non-actionable issue: demoralizing.
There is no joy in support, especially at the L1 and L2 tiers. It’s all pressure and no thanks. Consequently, churn is high. One very large IT services provider quoted me their average churn time for L1s: 4 weeks.
Add to churn the inability to capture diagnostics, resolution actions, and operator insights during the incident management workflow. This is a further nail in the coffin for the value of ServiceNow and other ticketing systems. This makes it harder still to deploy robotics automation tools like Resolve Systems, Wipro Holmes, and IPsoft.
As first responder, an L1 is often close or has already exceeded service levels. This is the norm, not the exception. If they spend time on diagnostics and can’t deduce the cause or resolve the issue, then they need to escalate the ticket. There are other tickets waiting. There’s no time to document what diagnostics activities have been tried and their results. The situation is pushed to the next tier of support or adjacent technology silo.
Often the L2 will perform all the same tests that the L1 has previously… then some more. The diagnostics time just doubled. Presuming that L2 actually resolves the issue doesn’t presume that they will have time to document the fix either. When service levels already exceeded, that knowledge is never captured.
Knowledge Capture is never there for lower levels of support to glean. Knowledge Capture doesn’t exist for automation tools to access. It’s just lost, often walking out the door with employee churn.
Bringing It All Together: the Economic Value of AIOps
The problem to be solved is reducing the entire scope impacting slow MTTR. It is not being solved by unified monitoring, Big Data warehouses, and ITSM tools like ServiceNow. By automating the entire event-to-resolution workflow using AIOps, all the shortcomings of traditional trouble ticketing dissolve: time to detect, causality and context, situation awareness, diagnosis and resolution time, knowledge capture and recycling, workflow processes, and more.
Moogsoft demonstrates massive economic value for enterprises that struggle with trouble ticketing:
- Reducing the number of actionable tickets, by enabling the same number of O&S staff to handle apps and services of increasing complexity
- Detecting actionable issues earlier, enabling action and resolution well within SLA times
- Detecting anomalous behavior, acting on and resolving it before service to customers and the business itself is impacted
- Offering situation awareness with necessary context for all issue stakeholders— saving valuable diagnostics time, and crisis/business continuity processes
- Capturing diagnostics and resolution activities, when cases need to be escalated to secondary and tertiary expertise, thereby reducing MTTR
- Generating a plethora of predictive insights and knowledge for future cases, which further reduce escalations
Moogsoft uses unique patented AI innovations to detect evolving issues as Situations, i.e. clusters of alerts.
A Situation is a representation of one or more incidents as they’re evolving. Incidents are pieces of monitored operational data— such as event logs, alerts or metrics—that reflect an anomalous event that merits attention. These incidents may or may not be impacting service delivery at the moment Moogsoft surfaces them. In other words, incidents aren’t always indicative of outages.
Each incident in a Situation has a stakeholder/owner that can act on it in some way: diagnosing the issue, fixing it, or notifying others about it, such as end users. Stakeholders can be either negatively impacted by an incident, or identified as the cause of it. A Situation is remediated in Moogsoft AIOps platform’s Situation Room, where stakeholders from different teams can collectively analyze it and collaborate on addressing the issue.
Our patented Situation Room allows Moogsoft to transform IT Operations economics by driving a more efficient incident workflow. Each Situation’s clustered alerts enable the relevant issue(s) to be “socialized” to the appropriate stakeholders with situational awareness, i.e. context to accelerate diagnosis. Teams can then collaborate to take actions appropriate to their domain as either part of the problem (causal) or the impact (collateral). All can document what actions have been taken and report back, automatically creating knowledge artifacts that relate to the Situation.
Here are just some examples of the economic value of Moogsoft realized by customers:
- 99% reduction in noise event telemetry
- >4 hours earlier detection of actionable issues
- 75% reduction in L1 activities
- 90% reduction in L1-L2/SRE escalations after 3 months after adopting Situation Room workflows
If ServiceNow is the workflow that your users are accustomed to, you can layer in Moogsoft to make your ServiceNow more efficient. But after that, you’ll be able to do what many of our customers have. Switch off their Trouble Ticketing workflow and adopt our Situation Room workflow instead! Some of our customers refer to what Moogsoft enables as “Collaborative Ticketing”.
Welcome to Moogsoft. Welcome to resource-efficient, collaborative-enabling, business impact-averting, event-to-resolution workflow.
Trouble Ticketing is dead. Long live Collaborative Ticketing!
About the author Mike Silvey
An expert in IT operational management and technology commercialization, Mike launched SunNet Manager in the UK for Sun Microsystems before founding an open systems service management business at Micromuse where he brought several innovative service management tools into the European market (such as Remedy) and established key OEM relationships (Cisco, HP, Intel) that led to successful IPOs for both Micromuse and RiverSoft. Today, Mike is focused on and scaling Moogsoft by overseeing strategic business relationships with key partners around the globe.