As Artificial Intelligence and Machine Learning continue to attract attention and investment across the world’s more forward-thinking businesses, it’s normal for the people depending on the technology to be skeptical about the effectiveness and trustworthiness of the underlying algorithms. AI is, after all, automating and assisting with a lot of the work that was previously in the hands of humans – but with the promise of much better speed, precision, and scale.
In Healthcare, for example, AI is already being used for various life-changing use cases, including detecting early signs of cancer from blood samples or predicting heart disease. Medical professionals naturally want a clear understanding of how and why the AI is providing certain answers to avoid false diagnoses.
In the world of IT Operations, similarly, Artificial Intelligence for IT Operations (AIOps) is now being leveraged to analyze data within the world’s most complex IT infrastructures to keep the lights on across mission-critical business services. IT Ops and DevOps teams from more risk-averse businesses, now handing off significant workloads to AIOps platforms, also want a clear understanding of AI decision-making before entirely depending on these platforms.
But as AI becomes more data-driven, intelligent, and change-tolerant, visibility into the underlying decision-making processes becomes harder to achieve. So how can IT Ops and DevOps teams that need a smarter way to run operations learn to trust technology that they don’t fully understand?
Trusting Data-Driven AI
Running a Proof-of-Value (POV) for AIOps platforms is an effective way to test how a product really works in your own environment and how compelling the results are. (For more guidance on how to evaluate AIOps Platforms, check out this AIOps Buyers Guide)
But as you get into AIOps Platforms that leverage true AI & ML science that can interpret data without being explicitly told what to look for, as opposed to the more common rules-based and pattern-matching approaches, you’ll find that end-user visibility into the exact decision-making processes decreases, due to reasons involving the inherent sophistication of Data-Driven AI and also protection of IP.
Real-world results on the effectiveness of AIOps are overwhelmingly positive (just ask Gartner), but there are still risk-averse businesses that hesitate to trust AI because the more sophisticated algorithms don’t provide the same visibility as the human-defined, rigid logic underpinning the more basic approaches, which clearly explains the decision-making process to end-users.
On the one hand, this is understandable considering that anyone with a scientific background is bred to question anything they don’t understand. On the other hand, the hesitation is odd, as everyone and their grandmothers today already use and benefit from true AI and ML on a daily basis, whether realizing it or not.
A few examples of true AI & ML in our daily lives include:
Transportation: Uber uses AI to make ride-sharing more efficient, identify fake accounts, suggest optimal pickup and drop-off points, and even predict UberEATS wait times. (More: https://eng.uber.com/machine-learning/)
Spam Detection: Through the use of an artificial neural network, Gmail successfully filters 99.9% of spam. Gmail’s spam filters continuously learn from a variety of signals, like the text within the message, message metadata (where it’s sent from, who sent it, etc.).
Shopping Recommendations: With hundreds of millions of customers and products, Amazon uses AI to generate highly personalized and accurate product recommendations for shoppers.
In all of these examples, the logic behind the AI isn’t available to the end-users, but it’s used and trusted on a daily basis. Why? Because Data-Driven AI is capable of identifying signals and patterns within massive volumes of data when humans just can’t. The AI in these use cases provides sustainable economic value for these businesses and their customers. In other words, it works!
Much like Amazon shopping recommendations, AIOps platforms can help IT Ops and DevOps teams by recommending which Alerts to focus on even what resolution steps to take.
Understanding Data-Driven AI
Despite popular belief, the AI used in more advanced AIOps platforms can and should be understood by the people using the technology. There just needs to be a distinction between understanding how AI works conceptually vs. understanding each exact step that was taken to produce an output.
Every person interacting with AI should understand how it’s working conceptually. This practice helps us understand the best use cases for each technology and when the outputs should be treated with trust vs. caution. For example, there is usually a tradeoff between precision (accuracy) and recall (visibility) with AI techniques, and users should understand which is the case so that they treat the outputs properly.
On the other hand, understanding each computational step that an AI technique takes to generate an output is not only becoming more and more difficult as AI becomes more sophisticated (and protected), but it’s also not relevant if the tested outputs are strong – i.e. better, cheaper, and faster than your current state.
When assessing products that leverage Data-Driven AI, people need to focus on whether or not they are achieving sustainable economic value from AI, and not on how the AI generated a specific output.
Otherwise, you will miss out on the benefits altogether.
Rules vs. Data-Driven AI
AI & ML are thrown around interchangeably today, and quite frankly, incorrectly, especially in the AIOps space.
If you look at the AIOps landscape today, most vendors who claim to leverage AI really perform basic data analysis using techniques like Rules, Behavioral Models, and Regex Patterns. These techniques depend on rigid, human-built logic to accomplish Alert Suppression, Alert Grouping, and Incident Detection. The key point is that the logic here isn’t described or changed by the data. This is the approach taken by vendors like BigPanda, PagerDuty, and ServiceNow.
These techniques are simple, easy to use and provide complete visibility into how they function. For the smaller and more simple IT environments with infrequent change, AIOps platforms that take this approach are a great option! Even in large and complex environments, these techniques are useful for catching ‘known’ alerts and patterns that you anticipate.
On the downside, they are intolerant to change and require 100% matches across alert attributes to identify and act on specific alerts. Even worse, each model relies upon the precise alignment of event arrival order, construction, and the logic applied to get the desired result. As those change, the results don’t just degrade; they go from 100 to 0 percent. If you work in a large IT environment, your data probably isn’t consistent and clean enough for this to be effective. The consequence is that you need to continuously build and adjust rules and correlation patterns to accommodate your growing IT infrastructure, while still running the risk of missing hidden signals and patterns due to rigid, inflexible logic.
These techniques are undoubtedly valuable, but modern IT environments with serious complexity require more advanced approaches as well.
AIOps platforms with True AI have sophisticated algorithms, that are proprietary and have been tailored through research to solve specific use cases. The vendors behind the platforms don’t reveal their secret sauce because there’s intrinsic value in how their algorithms transform data to solve big problems and they don’t wish for other vendors to reproduce their work. This is certainly the case at Moogsoft.
Data-Driven AI techniques are absolutely crucial for managing increasingly large and complex IT infrastructures without continuously adding more bodies to interpret data. These techniques allow you to extract real-time insight from massive data volumes across rapidly changing environments, without relying on humans to redefine and maintain their logic.
A majority of production outages across enterprise IT today are ‘Black Swan’ outages, meaning that they haven’t occurred before and existing models and predictive approaches can’t catch them (Source: Hundreds of conversations with Enterprise IT folks). Unsupervised approaches that make sense of data without being explicitly told what to look for are capable of identifying the ‘unknown unknowns’, and bringing them to the attention of humans before they become a customer-impacting issue. While these underlying computational steps aren’t fully visible to the end-users, these AI techniques are providing crucial insights that you can’t otherwise obtain. Additionally, they are flexible, change-tolerant, and require little to no maintenance.
What’s more Transparent: Data-Driven AI or Human Interpretation?
What’s ironic about the concern around the transparency of AI and ML is that Data-Driven AI is actually much more transparent than human intelligence. If you think about it, the human mind is a complete Black-Box – no one understands how it works.
In modern IT Operations, humans are manually analyzing alerts, manually removing noise, manually identifying relationships across tools and silos, manually opening Trouble Tickets, and manually escalating those Tickets across teams for resolution. Despite how smart the team is or how well-defined knowledge articles and protocols are, there is almost always an element of guessing and human-intuition behind each decision.
Furthermore, human decisions are poorly documented. I’ve spoken with many Post-Mortem/RCA teams who explained to me that the way in which any particular Incident unfolded, how it got resolved, and why it got resolved a certain way is essentially a Black-Box!
With AI techniques, there is always a direct connection to the data. Unlike human intuition, AI can be interrogated and through testing, reporting, calibration, etc.. Contrary to common belief, AI techniques will improve transparency across IT Operations. Even better, it’s now possible to manipulate AI within AIOps platforms through human feedback.
The truth is that both rules-based and AI techniques are crucial for managing enterprise-scale IT environments. The key is to invest in an AIOps platform that leverages both.
While the rules-based approaches are more useful and straightforward for catching known patterns, AI approaches have the secret sauce that is critical for capturing the things that you don’t know, and with minimal effort. Together these approaches will massively reduce your workload (i.e Alert & Tickets), provide early warning into potential Incidents, and provide context around Incidents so that you can resolve them faster.
Not understanding the exact decision-making process behind AI can be scary to people that are used older approaches, but IT Ops and DevOps teams need to focus on whether or not those AI techniques are bringing sustainable economic value to your business.
About the author
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.