Last week, AppDynamics held their 2nd annual user conference (AppSphere). One key takeaway is that the company’s momentum and growth was very apparent. The event was full of energy and excitement, in part from AppDynamics’ large pre-IPO financing, their new 4.2 release, and the 1500+ IT Operations leaders in attendance from top organizations around the world.
To make the most of this opportunity, Moogsoft decided to conduct a survey to better understand the current state of monitoring from the attendees.
We asked the following ten questions:
On average, organizations have between 5-10 different monitoring tools in their production environment, with 50% of organizations having 10 or more tools at their disposal. Thats over 10 different perspectives of performance across applications, networks and infrastructure.
This prompts us to the next question of:
According to our results, over a third of organizations haven’t even attempted event correlation, meaning it’s currently done manually by level 1 and level 2 operators. What is perhaps surprising is that over a third of organizations actually built their own event correlation solution. The remaining third of organizations currently use dinosaur MoMs like IBM Netcool, CA Spectrum, HP OpenView and BMC Event Manager (aka TruSight).
We now know customers’ approaches, but how well is the event correlation/management actually working?
When asked about the sheer volume of events/alerts that are generated, over 50% said that it was unmanageable or ‘total madness’. 35% said that it was slightly overwhelming and just 10% said that it was manageable. So despite many organizations building their own event correlation or using a dinosaur MoM, they are no better off when it comes to being on top of the event volumes they face every day.
When it comes to detecting and managing IT incidents, time is of the essence. Our next questions focused on Mean-Time-To-Detect (MTTD) and Mean-Time-To-Restore (MTTR).
Over 50% of customers reported that it takes them over 15 minutes to ‘detect’ a severity-1 incident. ‘Detect’ meaning IT is aware that a severity-1 incident exists. 15 minutes might not sound like a lot, but when you’re business is software-defined, 15 minutes can be worth several hundred thousand dollars. For example, $4.4b of online revenue works out at $8,371 per minute which is $125,570 of revenue risk over a 15 minute period.
Organizations should be aiming to pro-actively detect a severity-1 incident in under 5 minutes for obvious reasons.
With the level of complexity and deep-rooted interdependencies that exist in a modern IT environment, incidents are also difficult to resolve. For example, the root-cause of an impacted application may or may not lie within the application itself. This clearly reflected in the results of the following survey question.
When asked how long it takes to ‘resolve’ a severity-1 incident, a whopping 50% of organizations reported more than 30 minutes. 38% reported more than 15 minutes. These responses were quite surprising, but still paint a bleak picture of the challenges that lay ahead for IT operations each day. When you combine the average time-to-detect of 15 minutes with the above responses, most organizations are looking at anywhere between 30 to 60 minutes of business impact per severity-1 incident.
One question we always ask prospects and customers is “How many of your incidents are repeat or duplicate incidents that have happened in the past?”
Not surprisingly, over 90% said that more than 25% of their total incidents are repeat incidents.
This indicates that these customers have multiple toolsets with multiple perspectives and lack of correlation. It also indicates that there’s a huge lack of knowledge capture, recycle and automation going on in IT. If so many incidents have already been observed before, what is the excuse for not detecting and learning from them in the future?
Now, how many people are typically involved with IT support? In this survey, 53% of organizations reported that their organizations had level 1 & 2 support teams of 10-50 people.
Given that the average fully loaded cost of support staff is around $100K, this means that organizations are spending anywhere between $1m and $5m each year on support. When you factor in the rate of scale and change for the business and IT, these numbers could easily double over the next three years.
As the previous survey questions indicate, there are a lot of problems related to incident management with the customers we surveyed. So what are their biggest challenges that are keeping them from optimal performance and availability?
The top answer by a large margin was “the lack of a single pane of glass view”. The next top answer was the lack of collaboration.
This makes a lot of sense given the responses we’ve seen previously in this survey. Most organizations have 10+ toolsets, find event volumes unmanageable and have relatively modest detection and resolution times for severity-1 incidents.
Next, we wanted to understand that these customers wanted next from their monitoring vendors. The top answer was predictive analytics to proactively warn of slowdown or outages. The next top answer was better alerting with less noise and false positives.
Lastly, customers were asked about which deployment options they prefer for their monitoring platforms.
~70% of customers reported a preference for on-premise deployment while ~30% preferred SaaS. A mere 4% reported a preference for private cloud deployments.
The percent of SaaS deployments will undoubtedly grow over-time as more organizations migrate and deploy to the cloud. What’s interesting is that most organizations still prefer on-premise in 2015. Somewhat surprising given the flux of SaaS only monitoring vendors that are out there.
About the author Sahil Khanna
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.