SREcon18 Americas happened in Santa Clara, CA at the end of March, and it was teaming with SREs. (Shocker, I know.) A whopping 52.1% of those we surveyed had “SRE” in their title…but beyond the obvious (Site Reliability Engineering), what does SRE really mean?
If you don’t know, take a look at my coworker Sahil Khanna’s blog on the subject, “DevOps and SRE: Comparing Apples to Oranges?” But for those of you who only have 30 seconds, here’s a quick summary:
- DevOps is a method to simplify the partnership between dev and ops
- SREs care about availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
- Think of SRE as a subset of DevOps, with a stronger focus on software development
Now onto the Monitoring Survey results… with a SRE twist!
One attendee chose “Panicked Screaming” as the Notification tool of choice for his SRE team. Does this ring true for you?
Key Findings
- The top three monitoring tools were Splunk, Nagios, and Datadog.
- The top three monitoring challenges were alert noise, event correlation across tools, and monitoring coverage.
- The average level of alert volume per month most commonly cited by survey respondents was in the hundreds.
- The average number of P1 / SEV-1 incidents per month cited by most of the attendees we surveyed was 0 – 2 per month.
- On a scale of 1 – 10 — 1 being the most reactive, and 10 being the most proactive — most respondents scored their companies at 7.5.
Top Monitoring Challenges
When asked what their top monitoring challenges were, over 70% of those surveyed admitted that they’re still struggling with Alert noise. Since 73% of respondents also said they are using five or fewer monitoring tools, we have to wonder if these tools are just too noisy, and/or if they’re not providing actionable insight.
One attendee chose “Panicked Screaming” as the Notification tool of choice for their SRE team. Does this ring true for your team?
Like most other Monitoring Surveys we’ve conducted, Alert noise was top of mind for SREcon18 attendees. That’s nothing new. But since availability is a primary concern for SREs, these folks seem to be good at their jobs — almost 50% said that they only experience 0 – 2 P1s (wide-scale business impacting problems) every month.
These SREs are also an intuitive bunch. When asked, on scale of 1-10 — 1 being the most reactive, 10 being the most proactive — almost 50% said their company is extremely proactive (7/8) when it comes to incident management.
But what about the size of these SRE’s environments? Well, roughly ~30% have 0-1,500 servers, and another ~30% have over 30,001 servers. So it was a mixed bag.
SREcon18 Monitoring Survey
The most exciting addition to the Event Manager club: Grafana! 1.4% of attendees at SREcon said that they’re using the open source tool as their Event Manager. Sure, 1.4% is not statistically significant, but we’ve never seen the tool listed on any previous Monitoring Survey.
Datadog beat out AppDynamics for the top APM spot! AppD actually came in third place this time around, with a measly 9.8% of attendees saying they use it. AppD dominated the past four monitoring surveys we’ve conducted (Elastic{On}, Atlassian, VMworld & Cisco Live), so I wonder how they feel being dethroned by a purple dog.
- SolarWinds continues to be the top NPM tool.
- Nagios is still dominating the infrastructure monitoring tooling ecosystem, with over 50% of the respondents saying that they use it.
- Splunk takes the top spot again with 55% of SREcon18 folks saying that it’s their log management tool.
- SolarWinds’ Pingdom was the top synthetic monitoring tool in this survey — but only with 36.4% of attendees said that they use this digital experience monitoring (DEM) solution.
- Almost 50% of those we surveyed said they use PagerDuty. It’s also nice to see OpsGenie make the cut — it’s been a while.
- Jira was the top ticketing tool at SREcon18. We’ve seen this with SREs before…why is it that they seem to prefer Jira over ServiceNow?
- Slack = King of Comms
- AWS continues to kill it. Will Azure ever be #1?
SREcon18 Conclusion
I loved it when Director of Google Customer Reliability Engineering Dave Rensin said during his session, Building Successful SRE in Large Enterprises — One Year Later, that Site Reliability Engineering works in companies of all shapes and sizes. He also said you don’t have to look like the Googles or Netflixes of the world to make SRE work.
So go forth large, slow enterprises, and get your SRE on!