Nassim Nicholas Taleb is a renowned scholar who has spent his career focusing on problems of randomness, probability, and uncertainty. In Taleb’s book “The Black Swan – the Impact of the Highly Improbable,” he defines a “black swan” as a highly improbable event with three principal characteristics: it is unpredictable; it carries a massive impact; and, after the fact, we concoct an explanation that makes it appear less random, and more predictable, than it was. According to Taleb, the success of Google was a black swan, and so was 9/11.
Critical IT outages today are typically the result of “black swan” occurrences. The power failure that led to the major Delta Airline outage in August that grounded about 2000 flights was a black swan. The October 21, 2016 DDoS attack that impacted Twitter, Reddit, Spotify, Soundcloud and countless other sites and services was also a black swan.
There are everyday incidents and failures that occur, which are anticipated, and therefore organizations that care about their service quality are prepared for them. They have invested in the tools to alert the appropriate teams when they identify something going wrong. This issue is, how do those organizations prepare for something that they’ve never seen before? How will existing rules and models that monitoring tools rely upon work when presented with unruly and unmodeled problems?
This issue is reflected by the following quote from IT analyst firm, EMA:
“87% of organizations have more than 5 monitoring tools, but only detect 27% of incidents”.
How can Black Swan Occurrences be Prevented?Wouldn’t it be nice if there was a way to prevent unpredictable events from occurring in your IT environment? Yes it would be. Unfortunately, modern IT environments and service delivery models are unpredictable by nature. The increased adoption of things like virtualization, containerization, cloud, and agile development only contribute to unpredictability.
In short, black swan occurrences are inevitable. The solution is not to figure out how to prevent them, but instead to prepare yourself to address them in such a rapid and efficient fashion that you avoid impact to your business services and keep your customers happy.Beware of User-Driven Incident DetectionWe get it — when someone’s job is to identify and address IT faults and errors, it’s exciting to be the hero. Today’s Ops and DevOps teams are highly skilled in a variety of tools and proprietary query languages, and they’ve customized their own dashboards to reflect the most crucial KPIs. The problem is that user-driven detection doesn’t work for black swan occurrences.These dashboards are only as effective as the quality in which they were built, at the moment in time that they were built. In other words, these KPIs and correlation searches were configured in anticipation of certain issues that could be modeled for and predicted with a estimable degree of probability, but it’s really going to be an unanticipated issue that will eventually bring down your service.
The metrics being collected and presented could be an indication, or a symptom, that something is wrong, but they aren’t actionable. If anything, they let the operator know that they need to start looking for something actionable. Humans should be spending their time troubleshooting an issue once it has been detected and contextualized.Black Swan Occurrences Require Automated Proactive Insight
When a black swan incident does occur, there are two things that an organization needs in order to prevent impact. The first is comprehensive monitoring instrumentation. This will provide visibility across the entire physical and virtual stack so that any unusual feature is represented via an event. This is crucial. However, modern large-scale environments will be flooded with millions of these events each day, with false positives and duplicates that obfuscate visibility.
The second requirement is applying unsupervised machine learning with an IT Operations lens, in real-time. Unlike preconfigured queries, rules, models, etc., unsupervised machine learning is the only possible way to detect the unknown unknowns, a.k.a. the black swans. These algorithms can automatically analyze billions of events in real-time to understand unusual “features” and identify complex relationships across production stacks.
As opposed to user-driven detection, these algorithms actually tell you what to look at, in real-time and with context. With this technology in place, teams can be notified of a black swan occurrence in just seconds, and have the narrative required to effectively conduct their user-driven analysis to resolve the issue before service is impacted.Be Proactive by Leveraging Algorithms
With the abundance of freak incidents impacting some of the world’s most highly used services this year, it’s clear that traditional assumptions about incident management no longer hold true. Black swan incidents are the biggest threat to service quality and can be effectively mitigated by applying unsupervised machine learning. Tools like Moogsoft offer these algorithms to provide actionable insights as well as an automated workflow, allowing operators to understand and address these unpredictable incidents before they have any impact.
About the author Sahil Khanna
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.