I finally got a chance to sit down and collect my thoughts about the recent Monitorama 2015 event that took place in Portland. It was a great couple of days, with a host of fascinating and often entertaining talks by folks deeply imbued in the day-to-day struggles of monitoring rapidly evolving infrastructures.
I was given the opportunity at Monitorama 2015 to present a three and a half minute lightning talk on the concept of Real-Time, Collaborative Situational Management, explaining how it improves service availability in a DevOps environment, saving time and money. I tried to deliver this topic with a light and humorous tone, which seemed to go over well with attendance given the responses of laughs and cheers.
My Key Take-Aways
Over the course of Monitorama, attendees saw a number of talks outlining how open source monitoring solutions have been deployed, and in a number of cases, how they’ve been developed. An interesting observation I noticed was that there was a big focus on the mechanics of monitoring, such as what to monitor, clever ways to reduce footprint, ways to handle massive scale, ease of deployment, what transport mechanism to use, etc. Inés Sombra in particular captured everyone’s attention with her insight and experiences from Fastly.
What was less discussed, however, was what to do with all the data once it had been captured. Dashboards were fairly well represented, being the de-facto final resting place for most instrumented data at small scale. Yet the other popular discussion topic – and a more pertinent area of interest (at least for me) – was alerts.
Here at Moogsoft, we are voracious consumers of alerts, and as such, have fairly strong opinions on what an alert is, isn’t, and what it could be. As I sat and listened to the monitoring exploits and recommended best practices, I wondered how alerts have evolved over the years, and what exactly constitutes a “high quality” alert today?
In some cases, we’re still struggling with the basics, striving for a state-full alert that can tell you when a problem condition has been resolved without an expensive validation test, or worse, inconsistent heartbeats that force the implementation of timers. Fortunately, with increased adoption of modern interfaces, and expressive message formats such as JSON, the technical reasons for poor quality are dying out.
Furthermore, for a variety of reasons (many understandable and good!), alerts coming out of tools, software, and infrastructure today are less structured and less consistent then they were 10-20 years ago, forcing the need to use machine learning to make sense out of it all. Without an algorithmic, data-driven approach, it’s nearly impossible to separate the signal from the noise, making it difficult to get situational awareness earlier to see when an anomaly is unfolding in real-time.
I’ve also noticed that one of the bigger obsessions recently has been “scale,” i.e. how many billions of events per second can be evaluated and stored. But again, without an automated, data-driven approach to signal to noise reduction, this will only result in (other then addressing the issue of reliability) even more alerts to process.
Anomaly detection was yet another area of focus – now, this is getting interesting. When it comes to the “thresholding” of time-series metrics, the state-of-the-art hasn’t really changed in decades, and we’re still struggling with the limitations of static threshold values. Anomaly detection is still in its infancy, but the acknowledgement that we can use algorithms to pre-process data, and improve alert quality is great news for the industry.
About the author Richard Whitehead