SREcon18 Monitoring Survey

Kelsey Hanger | Tuesday April 10 2018

Moogsoft surveyed attendees at SREcon18 about the IT monitoring challenges they face & the tools they’re using to solve them.

SREcon18 Monitoring Survey

SREcon18 Americas happened in Santa Clara, CA at the end of March, and it was teaming with SREs. (Shocker, I know.) A whopping 52.1% of those we surveyed had “SRE” in their title…but beyond the obvious (Site Reliability Engineering), what does SRE really mean?

If you don’t know, take a look at my coworker Sahil Khanna’s blog on the subject, “DevOps and SRE: Comparing Apples to Oranges?” But for those of you who only have 30 seconds, here’s a quick summary:

  • DevOps is a method to simplify the partnership between dev and ops
  • SREs care about availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
  • Think of SRE as a subset of DevOps, with a stronger focus on software development

Now onto the Monitoring Survey results… with a SRE twist!

Tweet Section

One attendee chose “Panicked Screaming” as the Notification tool of choice for his SRE team. Does this ring true for you?

Key Findings

  • The top three monitoring tools were Splunk, Nagios, and Datadog.
  • The top three monitoring challenges were alert noise, event correlation across tools, and monitoring coverage.
  • The average level of alert volume per month most commonly cited by survey respondents was in the hundreds.
  • The average number of P1 / SEV-1 incidents per month cited by most of the attendees we surveyed was 0 – 2 per month.
  • On a scale of 1 – 10 — 1 being the most reactive, and 10 being the most proactive — most respondents scored their companies at 7.5.

Top Monitoring Challenges

When asked what their top monitoring challenges were, over 70% of those surveyed admitted that they’re still struggling with Alert noise. Since 73% of respondents also said they are using five or fewer monitoring tools, we have to wonder if these tools are just too noisy, and/or if they’re not providing actionable insight.

One attendee chose “Panicked Screaming” as the Notification tool of choice for their SRE team. Does this ring true for your team?

Panicked Screaming was the Notification tool of choice at SREcon18

Like most other Monitoring Surveys we’ve conducted, Alert noise was top of mind for SREcon18 attendees. That’s nothing new. But since availability is a primary concern for SREs, these folks seem to be good at their jobs  almost 50% said that they only experience 0 – 2 P1s (wide-scale business impacting problems) every month.

These SREs are also an intuitive bunch. When asked, on scale of 1-10 — 1 being the most reactive, 10 being the most proactive — almost 50% said their company is extremely proactive (7/8) when it comes to incident management.

But what about the size of these SRE’s environments? Well, roughly ~30% have 0-1,500 servers, and another ~30% have over 30,001 servers. So it was a mixed bag.

SREcon18 Monitoring Survey

The most exciting addition to the Event Manager club: Grafana! 1.4% of attendees at SREcon said that they’re using the open source tool as their Event Manager. Sure, 1.4% is not statistically significant, but we’ve never seen the tool listed on any previous Monitoring Survey.

Datadog beat out AppDynamics for the top APM spot! AppD actually came in third place this time around, with a measly 9.8% of attendees saying they use it. AppD dominated the past four monitoring surveys we’ve conducted (Elastic{On}, AtlassianVMworldCisco Live), so I wonder how they feel being dethroned by a purple dog.

SolarWinds continues to be the top NPM tool.

Nagios is still dominating the infrastructure monitoring tooling ecosystem, with over 50% of the respondents saying that they use it.

Splunk takes the top spot again with 55% of SREcon18 folks saying that it’s their log management tool.

SolarWinds’ Pingdom was the top synthetic monitoring tool in this survey  but only with 36.4% of attendees said that they use this digital experience monitoring (DEM) solution.

Almost 50% of those we surveyed said they use PagerDuty. It’s also nice to see OpsGenie make the cut it’s been a while.

Jira was the top ticketing tool at SREcon18. We’ve seen this with SREs before…why is it that they seem to prefer Jira over ServiceNow?

Slack = King of Comms

AWS continues to kill it. Will Azure ever be #1?

SREcon18 Conclusion

I loved it when Director of Google Customer Reliability Engineering Dave Rensin said during his session, Building Successful SRE in Large Enterprises One Year Later, that Site Reliability Engineering works in companies of all shapes and sizes. He also said you don’t have to look like the Googles or Netflixes of the world to make SRE work.

So go forth large, slow enterprises, and get your SRE on!

Moogsoft AIOps helps modern IT Operations and DevOps teams become smarter, faster, and more effective by providing technological supplementation that automates mundane tasks, enables scalability, and frees up human beings to do what they do best — ideate, create, and innovate. Start your free trial today by clicking here.

Kelsey Hanger

About the Author

Kelsey Hanger is a Product Marketing Manager at Moogsoft. When she isn’t writing blogs about AIOps or conducting Monitoring Surveys, she loves finding unique eats in and around SF and traveling to the parts unknown, whether that be a speakeasy in Oakland or the ruins of Monte Albán in Oaxaca, México. Feel free to tweet her @KelsHanger or connect with her on LinkedIn.

See more posts from this author >

Leave a Reply

avatar
wpDiscuz