Almost every time I have a conversation with executives about IT operations, the topic “self-healing” or automation comes up. Now, I get that’s where companies want to go — and I’ve seen some good work around robotics, auto-onboarding, auto-provisioning, and basic run-book automations around restarting servers — but I think true self-healing IT is a vastly complicated and different conversation that is more about deriving intelligence from your data than buying a bunch of automation platforms.
With more than 20 years of experience in sales, I can immediately tell when executives have been sold fantastic stories and visions based on PowerPoint presentations. Buzz words like AI and automation are thrown around like confetti and many of them are based on incredibly oversimplified ideas of how these things work.
Back to the data question for a moment. Allow me to share a story about one of Moogsoft’s newest customers, and a competitive proof of concept they ran, pitting us against one of the big ticketing vendors that just introduced a new IT operations module.
A eureka moment came one day when we saw a repeating issue that was taking down their trading application about three times a week.
I knew my job was going to be tough going in. My competition was an incumbent with mindshare, existing contracts, and probably financial incentives around new modules. The potential client was a global Tier 1 financial enterprise that had operational silos and dozens of monitoring tools. The company had done something pretty cool, and aggregated their monitoring data into a single end-to-end bus for us to take a feed from, enabling a massive test of our capabilities in real-time scenarios.
Each vendor was asked to “ingest” live production data (some 600,000 events per day), and the goals of the assessment were event correlation and reduction of the time to detect and resolve incidents. The Moogsoft team showed up with our machine-learning platform, and our competition showed up with a rules-based system and a slick topology-based visualization.
To borrow a phrase from the movie Untouchables, our competition brought a knife to a gunfight.
Human Intelligence Guiding Artificial Intelligence
At Moogsoft, we know that success is based on the effective understanding of our customers’ environments and their data. This led me to make numerous calls to different operations managers to talk about their data, pain, and current processes.
During this process, a key executive let me know that event correlation would diminish in value for them as they were automating the resolution of issues at the alert level. Also, they already achieved automation against 20% of their alerts. While congratulating him on that, a thousand questions started running through my head. If they were getting tens of thousands of alerts (deduplicated and filtered events) daily, how many run books would they have to write? How would they keep up with the rate of change? How were they going to automate things that weren’t the obvious result of a single noncritical causal alert?
Luckily their head of automation gave me time to talk, and we had a frank and open conversation around these questions. We both agreed that the next 10% of automation was going to be exponentially harder than the first initial gains, and that it was a big data problem that involved the detection and diagnosis of complex recurring events that needed to be part of the automation workflow.
What’s more, they needed the capability to trigger an external action when those patterns were identified. We shook hands on an informal agreement that we at Moogsoft would attempt to show progress against that problem in our trial deployment — and agreed that if we could show this working, we would be aligned with their automation strategy.
And it worked. Of the 600,000 events we ingested per day, we were intelligently correlating them into about 400 actionable incidents. That was a big improvement over the 4,000 tickets they were creating through manually introduced rules. Once the operations teams started triaging issues with Moogsoft, our neural net started learning and detecting complex recurring incidents. A eureka moment came one day when we saw a repeating issue that was taking down their trading application about three times a week. Our algorithms saw the pattern, the cause, and what the teams were doing to fix it.
The next step was obvious: We configured Moogsoft to fire off a RESTful call to the API of an orchestration platform, which then would launch a remediation, the next time it saw one of those trigger patterns. And our platform did just that, and within weeks we had a new customer.
Big Issues Need Powerful Tools
Is this full self-healing? Probably not. Is it automation? Absolutely, and it’s a powerful tool in dealing with massive, complex data to identify and resolve big issues in IT operations. What I love about my job is demonstrating the power of artificial intelligence and seeing how it’s helping people on a level beyond simply reducing and prioritizing all the IT noise that’s common in enterprises. For me, it’s usually about helping people on a human level to make their work easier.