In ITOps, False Positives can lead to alert fatigue and wasted time and effort, but maybe the occasional false positive isn't a bad thing.
In Operations, “False Positive” is almost always a pejorative term. They are things to be avoided at all costs. They conjure up thoughts of unnecessary pages, false alarms, already busy people wasting time, and alert fatigue.
I’m going to offer a slightly unorthodox view, in which we learn not to fear the false positive, we manage them, and embrace the good they can bring. But first, how did we get here?
False-positives, and the desire to eliminate them, are one of the primary reasons people have tried (in many cases unsuccessfully) to achieve perfect Root Cause Analysis (RCA). But as everyone grudgingly admits, there is probably no such thing as perfect RCA, and even if there were, it would likely be cost-prohibitive, as it’s a great example of the law of diminishing returns.
Even in areas where RCA has achieved some success, such as network availability, it’s really just suppression of symptoms, rather than true cause.
Why are the returns diminishing? It’s largely due to the exponential growth of complexity and data, coupled with the rate of change in our environments. Most RCA techniques are based on some form of model, and modeling is struggling to keep pace with our constantly changing, microservices, hyper-virtualized world.
So as the symptom suppression solutions begin to crumble, the specter of false-positives once again comes back to haunt us.
To be honest, even the “good old days” of static environments weren’t that great. The elimination of false-positives, in favor of their apparent corollary, the actionable alert lead to aggressive over filtering. Useful data, that would have provided additional insight was rejected, causing all sorts of operation issues created by the inevitable lack of visibility.
False Positives, a Definition of Terms
The first thing we need to do is gain a better understanding of what a false positive might be.
The most obvious come from thresholds. If a threshold is set too low, it will trigger an alarm at a much earlier point. Not only will that create a possibility of alarm occurring for a normal operating condition, but you will almost certainly get far more of them.
Similarly for “downstream” suppression, if devices or services are unavailable due to the failure of a dependency, they are suppressed in favor of the “root cause”, the dependency. But does that leave you sufficient context? In the case of redundant networks, is the service available, or just suppressed as symptomatic?
Clearly there are better ways to approach this. For example, we would advocate clustering over suppression, but that’s a topic for a different discussion (check out Moogsoft’s work on Graph Topology clustering).
Precision & Recall
Speaking of clustering, no discussion of false-positives would be complete without an understanding of precision and recall.
When grouping (correlating) things together, recall can be thought of as a system’s ability to comprehensively identify ALL members of a group. Whereas precision can be described as a measure of the accuracy, or the avoidance of including things that don’t belong — a form of false positive.
By example, I’m looking out of my window watching vehicles drive by, and have decided to count the white cars that pass by over five minutes. If I correctly count every single white car, I can boast my recall is 100%. However, if half of my count includes white trucks, then my precision would be only 50%. The trucks might be considered false positives.
The question then becomes, what’s more important? To not accidently count trucks, or to be confident you’ve counted every single white car?
The trade-off between precision and recall is what we studied. In medicine, for example, a surgical oncologist’s goal is to remove the entirety of a tumor, but avoid unnecessary removal of adjacent tissue. The precise location of the tumor will determine if he prioritizes precision over recall, or vice-versa.
If you apply this concept to the clustering of operational alerts and events (such as a Moogsoft AIOps Situation), then we would always advocate recall over precision. It’s far better to occasionally group a superfluous alert into a situation, than miss an important one, and lose context. The “relaxed” precision situation will be more comprehensive, and contain all the data required to more rapidly resolve the situation.
Your Friend, the “False” Positive
hat about the false positives due to thresholds? In the context of a clustered group of alerts, the “false” positive really can be your friend.
By shifting from an alert-driven approach to a Situation focus, you are decreasing the potential for false positives. If your call to action is no longer a discrete threshold violation, but the situation in context, it becomes more useful than distracting.
Consider grouping the threshold violation with other indicators of issues, such as service degradation, customer complaints, and log messages. Now the “false positive” in isolation is ignored, but in context, it can become an early indicator.
Go ahead and lower that threshold a little. It won’t create more issues, but it might notify you earlier.
It’s at this point that some of the more astute people I talk to challenge me: If you start reacting too early, aren’t you running the risk of responding to false-positives again? While that might seem like a risk with many of today’s accepted practices, if you’re gaining the benefits of noise reduction that situation-based management gives you, that becomes a bargain of a trade off.
One final thought: Even if there is the occasional false-positive (the keyword is occasional, and we’re reporting and monitoring how often those occur of course) is it really a bad thing?
If your child coughs, you don’t automatically assume Flu, and start treating them. You do however keep a closer eye on them for a while.