What is (Still) Bothering IT Monitoring Professionals in 2018

I spent last week on the road with our good Moogsoft partners, Cherwell and amasol. Cherwell is of course an interesting ITSM vendor, making waves in what has been a fairly static market for the last few years, while amasol is a systems integrator based in Munich. The pattern that I saw emerge in a week of talking to practitioners and vendors is that there is a lot of good stuff going on, but a lot of it is happening in isolation, and value is getting lost in the cracks.

Starting a New Tradition with Cherwell

This is the first year for the Cherwell EMEA Conference, and it was a great event. Everyone at the show was looking for the next big thing that would give them an edge and help them keep up with users’ requirements and expectations.. One of the biggest frustrations that came up in almost every conversation I had was that there are simply too many tickets to deal with.

By placing an algorithmic layer in between the monitoring tools and the ITSM system, Moogsoft is able to make the experience better for everyone.

The cause: monitoring tools are inherently very chatty, telling operators about everything that is going on and leaving it up to the IT Ops team to figure out what matters. The problem with this approach is that it creates far too many tickets for people to be able to keep up with, drowning operations staff and risking burning them out. The problem is then compounded because it takes overloaded specialists too long to detect and diagnose problems – so users are prevented from doing their own jobs by IT issues, and then they start calling up the service desk and raising even more tickets.

Continuing a Tradition with amasol

The amasol event was more varied, with a mix of IT practitioners and vendors from various different domains. This is the annual AAWF user forum, where they encourage their customers to talk to each other, as well as to the various software vendors which they partner with. In addition to Moogsoft, other vendors in attendance included Microsoft and Dynatrace. However, the best parts were the presentations by IT Operations professionals, candidly discussing their successes and failures, and what they had learned from each event. The common problem that most people brought up was that while they had invested in tools in various areas, and they were more or less satisfied with their choices, they still struggled to get an overall picture of what was going on across all of the various teams and technology domains.

How to Make Existing Tools Perform Better

In my workshops and keynotes over the course of the week, I tried to show how AIOps can improve both sides of this situation: the “sea of red” that makes it hard to identify what is actually important, and the duplication of effort that comes with segmented views of the environment. By placing an algorithmic layer in between the monitoring tools and the ITSM system, Moogsoft is able to make the experience better for everyone.

Moogsoft AIOps analyzes events from monitoring tools and uses mathematical models and machine learning to identify in real time which events are even worth a second look. This avoids overwhelming operators with irrelevant noise, ensuring that alerts that make it as far as a busy IT Ops professional’s attention are actually worth their time.

This algorithmic noise reduction capability already goes a long way to reduce overwhelming tickets, but there is still the risk of duplication of effort because different teams are looking at different aspects of the same problem. Moogsoft AIOps is able to use algorithms to correlate events in real time across different data sources and technology domains, in order to build a complete picture of the actual problem that users are going to care about.

The great benefit of this approach is that it is does not require a massive overhaul of existing systems that are working today. Moogsoft AIOps can integrate with products that are already in place, unlocking additional value from investments that were made in the past. Another benefit is that the running costs of those existing solutions are also minimized. If you don’t need to spend time configuring your monitoring agents to filter events, you can get more done – and get earlier detection of issues, because you didn’t discard the “informational” event that was the early indication of something beginning to go wrong.

In the same way, if tickets coming into Cherwell are actionable, there is much less need to qualify and triage each one quite as hard. Incidents can instead be investigated immediately, further accelerating resolution times.