Why is it that most IT Operations teams still miss production incidents? It’s 2017 and most enterprises have best-in-breed tools like AppDynamics, New Relic, Dynatrace, Solarwinds, Splunk, Nagios, Riverbed and so on. They too have smartphones, sexy laptops, 27” sexy displays, and one-click web conferencing. They also get these annoying alerts before something bad happens. Yet most operators still miss most production incidents.
Well, after about two years of working with enterprise monitoring teams at Moogsoft, I’ve developed an insight to this, and can summarize the key reasons why teams still struggle to detect production incidents. You’ll be surprised at some of these findings — I know I was initially. It seems our industry still has a long way to go before operations teams can proactively detect incidents.
Here are the top seven reasons:
1. Your Monitoring Tools are Not Deployed in Production
Yep, you read that correctly. Just because you bought licenses a long time ago doesn’t automatically mean they’re actively deployed or used in production. Many customers thought their monitoring was deployed, but after an audit realized only a small percentage of their production stack was being monitored. Worse still, many hadn’t logged into their monitoring tools for weeks or even months, only to realize those tools had stopped working altogether. For example, monitoring tool infrastructure and agents can periodically fail, and if no one administers them they just remain down. The last thing you need is your monitoring tools to be down when your app or infrastructure is down. This happens more frequently than you would imagine.
Answer these questions: Which tools are currently active and deployed in production to monitor your…
- End user experience (transactions, browser & mobile devices)?
- Application run-times (JVM/CLR/node/engine/containers)?
- Database/NoSQL (Oracle, SQL, MySQL, MongoDB, Cassandra)?
- OS (Win/Linux/Unix)?
- Network devices (firewall/switches/routers)?
- Infrastructure devices (servers, VMs, machines)?
- Storage devices (SAN/S3)?
- Cloud services & APIs (AWS/Azure/Google)?
Any of the above can impact your production application and services. No monitoring = no detection.
2. Alerting isn’t Enabled on Your Monitoring Tools
I’m serious. You’d be shocked at how many organizations don’t turn on alerting. The level of monitoring maturity is still really low because most organizations install the tool and only use default out-of-the-box capabilities and settings. We’ve been trained to use monitoring dashboards, charts and queries because we think those things tell us all the answers. We’ve also been trained to ignore alerts because in the past alerting has done nothing but annoy us — “meh, bollocks to alerts.” At Moogsoft, most customers we meet initially have between 15-20% of their alerts enabled at any one time across all toolsets.
Again, how many of your production monitoring tools have alerting enabled? What do they tell you? Where do those alerts go? Who receives those alerts?
3. Alerting is Enabled but it’s Too Noisy
One key reason your alerting might not be enabled is because in the past it generated too much noise for your teams and operators. An easy way to get rid of noise is to simply disable alerts from specific tools. One insurance customer told me that one of their tools was generating 65,000 alerts over a 48 hour period, and to reduce noise they simply shut it down.
Noise-to-signal ratios are a huge problem in the enterprise. Moogsoft has customers that are generating up to a billion alerts a day — how do teams deal with that amount of volume
3a. Your Baselines aren’t Accurate (Static vs. Dynamic)
One reason for the generation of too many alerts is because alert or anomaly thresholds are too sensitive. Nearly every monitoring tool has static threshold capabilities — these are pure evil and will create more alert crap than anything, especially if you use default settings. Why? Because every application and environment is different. One size or value does not fit all. At the very least, you need to invest some time in alert configuration, thresholds and baselining. When was the last time you configured your alert thresholds and config?
Only a handful of monitoring tools have dynamic baselining and thresholding capabilities. This is where the monitoring tool itself learns what is normal or abnormal for a given time-series metric (e.g. CPU, response time, throughput). It creates a virtual boundary whereby values that exceed those ranges result in an anomaly being detected and an alert being thrown.
Did you know that your monitoring tool had this capability? If so, is it enabled today in production? Again, many features you buy might not be enabled by default. I’ve worked at several monitoring vendors that marketed game-changing capabilities only to find that very few were actually activated in production.
3b. Your Alerts Lack Context & Insight (Default Settings)
Simply put, do you understand what your alerts are telling you? Do your alerts make you or your team act (e.g. freak the f**k out and do something ASAP)?
As crazy as this sounds, some monitoring tools are really bad at alerting — both from a content and formatting perspective. They use weird, crazy terminology that only a rocket scientist would understand.
It’s worth taking the time to enrich your alerts, or to get them to tell you things that you might understand or care about. Again, don’t rely on the default alerting configurations. You can easily improve the intelligence of your alerts by spending a few minutes to customize the alert config inside each of your tools. Some are easier than others, but most are fairly intuitive and extensible.
4. You Use Email to Manage Alerts
Yes, this is not a joke. Many large organizations still use email as their primary alert console. It’s convenient and easy to setup, but it’s probably the worst way for teams to detect production incidents. The level of alert duplication is insane, as is the effort to maintain, filter, correlate, and analyze alerts. You might be able get away with email in a small environment where server and alert volumes are in their hundreds — anything bigger just becomes unmanageable.
5. You Use a Legacy Event Manager or MoM (IBM, CA, HP, BMC, EMC)
Enterprises that don’t use email normally have a legacy event manager, or manager of managers (MoM) to aggregate, analyze, and correlate their alerts. Tools like IBM Netcool, CA Spectrum, HP Network Node Manager, BMC Event Manager/True Sight, or EMC Smarts. The vast majority of these tools were built in the last century — you remember the ‘90s?
These tools were built to handle hundreds to thousands of alerts, primarily from the network or infrastructure tiers — a relatively easy way to handle and manage SNMP traps. Fast forward the clock to 2017, and these tools struggle to scale in the real world for a few reasons.
They were never built to handle:
- Millions to billions of events per day (scale);
- Events from next-generation tools like AppDynamics, New Relic and Dynatrace (platform support);
- Events from highly dynamic infrastructure — agile/microservices/cloud environments (future-proof).
The user interfaces for these tools also look like early versions of mine sweeper from Windows 3.1. Detecting production incidents using technology that was built 15-20 years ago is tough.
In addition, many enterprises have teams of people (normally 2-4 people) to create and maintain rules so that these legacy event managers can analyze and correlate alert data from today’s tools and environments. This is a hugely expensive, time-consuming and ineffective way to manage modern IT environments.
At Moogsoft, we work with enterprises that need a more agile, modern and scalable approach to managing alert volumes and detecting incidents. We are replacing static thresholds, filters and rules with dynamic algorithms (AIOps) that can analyze and correlate alerts in milliseconds.
6. Alerting is Enabled but You’ve Filtered Out Too Much
This is somewhat different to having your alerts disabled. Many enterprise monitoring teams have all their alerts enabled in production, but they’ve spent years creating filtering rules to help them suppress and ignore all the alerts that they think don’t really matter. For example, let’s ignore all CPU alerts from Nagios because those things just create noise and rarely indicate that anything is wrong.
This is a problem because it’s easy to accidentally filter out alerts that can come back to bite your ass in the future when you least expect it. Many production incidents are unique in terms of alert patterns, alert frequency, timing, content, and structure. Simply ignoring an alert because one operator thought it was the right thing to do at the time is a sure way of reducing your ability to detect future incidents.
7. Your Alerts Aren’t Managed in Real Time
Every second counts in production monitoring. If you can have a two- or three-minute heads-up that something bad is about to happen, that might be enough to save your bacon. It’s worth checking the time-aggregation and processing period of your current event/alert repository/console. It’s all well and fine ingesting events/alerts from different monitoring tools and event sources, but if that alert data takes 15 minutes to be indexed and analyzed by your tool/console/operator, then you’re pretty much instantly screwed.
You can’t be proactive and detect production incidents when you get notified 15 minutes after the fact. That’s enough time for your users to complain on Twitter, and for your CIO/CTO to pick up the phone and call you. Alerts these days must be processed, analyzed, correlated, and notified in real time. You can’t play the waiting game when it comes to detecting incidents.
Tell Us Your Story
I’d be interested to get your feedback on what you’ve observed at your team/organization. How effective are your teams at detecting incidents in production? What are the challenges you’ve faced or solved in the past?
About the author Steve Burton