Hail to the Site Reliability Engineer, or SRE. She or he is responsible for making sure their company’s web presence is up at all times and performing at its peak. The SRE’s work is best summarized as being obsessed around three questions: Does it work? Does it work well? Could it work better?
The role is relatively new, originated by web-scalers like Facebook and Google, but the SRE title is becoming more widely embraced. The terminology can vary: Facebook now calls it, “Production Engineering.” Airbnb’s SREs consider themselves as, “Developer Happiness Engineers.” Call it whatever you like, what SRE’s do is incredibly important and their teams are growing, e.g. LinkedIn’s SRE team consists of about 100 people.
Putting the SRE into Perspective at SREcon15
SREcon15, the user conference dedicated to the SRE, was held in Santa Clara, CA this week. Moogsoft was an active participant and sponsor of the sold-out event, helping to educate the 350+ attendees on how next-generation tools like Incident.MOOG can be used to automate incident detection and remediation across complex web operations.
I spent most of my time attending conference presentations, and as a result, I started to recognize a few common themes. These themes generally translate to best practices advocated by technical leaders of “operations at scale,” and they’re equally relevant to IT Ops and DevOps-centric teams of any size looking to improve their operational processes and service quality. In short, here are the top three takeaways I gleaned from the presentations at SREcon15:
(1) Tackle your most pressing problems first and relentlessly
It starts with being introspective and prioritizing your most pressing problems. Anticipate your company’s needs, and then get in front of the curve, not behind it. “Solve the problem that needs to be solved,” quipped Fernanda Weiden, a Production Engineer @ Facebook. For web-scalers, a top priority challenge is operational scale, but they relentlessly attack it, often coming up with new and out of the box approaches to get ahead of the enormous growth in load hitting their web sites.
The “engineering” of IT for operational scale is a common challenge that I hear from Moogsoft customers and the prospects that I work with. If you manage IT for a large company, or you work for a smaller company that’s growing very quickly, achieving scale is strategic. A key element of operation scale is your “operational intelligence,” and the monitoring architecture you design and deploy. A common three-layer framework starts with:
(a) A bottom layer that “records” it all, based on instrumenting everything;
(b) A middle layer to turn the massive resulting data stream into exception-based, alert streams;
(c) A top layer that ingests all event feeds into a Manager of Managers (MoM), a tool that can automate in real-time incident detection using machine learning and incident remediation using social collaboration. Remember, when it comes to scale, not all of your operational monitoring tools are created equal. There are tools that are effectively toys, and then there are those designed for scale from the get-go. Incident.MOOG is architected for scale, supporting some of the largest Internet, financial services, and mobile operator companies in the world.
(2) Get more proactive in improving operations
Minimize how often you’re in fire-fighting mode by investing when you’re not fire-fighting, so that you can do things better in the future. Engineer your production operations. Everyone acknowledges that this is often easier said than done, but conference presenters offered two memorable pieces of advice: (a) simplify everywhere you can, i.e. by eliminating technical debt, removing cruft, and reducing to the bare minimum wherever possible, then (b) commit to follow-through and implement the actions that come out of any quality improvement processes, e.g. weekly incident post-mortem meetings. Expect things to go wrong in production – they always do – and put things in place so that you’re simply ready for when “it” hits the fan, minimizing any negative impacts.
Presenters often referred back to the value of deploying a sound operational intelligence framework (aforementioned) as the basis for being proactive to understanding and being ready for anything. Situational awareness is paramount, which comes from a real-time holistic view of what’s going on across the entire IT environment, along with the automated reduction of “exception” information that everyone can look at when something goes awry. Of course, the earlier the detection of anomalous behavior, the better. “Quantize thyself,” stated Sue Lueder @ Google, in speaking about how to proactively approach incident analysis. “Collect your data from everywhere, drill down by asking five times ‘why’ to understand the root cause of outages, and then commit to implementing the improvement actions.”
Incident.MOOG’s Situation Room UI and the archival of the response and remediation activity for each Situation (incident) as it’s closed is a fantastic data source for what Sue is talking about. Various Moogsoft customers have also analyzed this rich information source to assess and act on incident trends, making significant improvements in reducing future outages as a result. This is but one of the many ways taking a more proactive approach helps to improve production operations performance.
(3) Simplify by automating everything you can
In IT, we are perpetually forced to do more with less. The appetite for service quality only gets greater. The only way around this is through automation. Automation reduces manual errors. Automation gets things done faster. Automation frees IT to be more proactive. Automation is working smarter. “You should always be trying to automate yourself out of a job,” stated Xianping Qu, an SRE who implemented a comprehensive monitoring architecture at Baidu.
Incident detection and remediation are ripe for automation all across IT shops – the tools are available and the results are striking. Incident.MOOG does just this. Technologies like machine learning automate the “noise reduction” of event streams and cluster related alerts into single Situations – all without dependence and labor-intensive rules and models that have to be maintained when something changes. Furthermore, technologies like social collaboration automate the notification, bringing together the right stakeholders so they can easily share information and resolve incidents faster. Finally, robust tool APIs are essential to integrate with other tools, allowing actions to be automatically initiated based on real-time analytics. As an example, Incident.MOOG provides full APIs using open, web interfaces at multiple layers of the platform.
In his keynote presentation, Pedro Canahuati (Production Engineering Director @ Facebook) recited the well-known quote from Deepak Chopra, “All great changes are preceded by chaos.” Implementing these aforementioned best practices takes personal gumption to overcome the likely cultural, process, and technological resistances to change in your organization. But the beneficial outcomes to your business (and don’t forget you too) are simply enormous, so keep that front and center as you persevere. What are you waiting for?
Underneath my executive demeanor, I’m a geek at heart, and I really enjoyed SREcon15 – almost to the extent as if I were hanging out in the clubhouse with the players of my favorite major-league baseball team. Many of the attendees and presenters at SREcon15 are the all-stars in my field.
I started my career (using Facebook parlance) as a Production Engineer for a Global 50 conglomerate, so I know firsthand what it’s like to have a job that challenges all of your planning and troubleshooting skills, along with the fact that you never know what you’re going to encounter when you come to work each day. SREs are the unsung heroes that make the most popular web sites across the global Internet run so well, despite the enormous complexities and need for scale.
Hail to the SRE!
And hail to USENIX and SREcon15 program co-chairs (Sabrina Farmer @ Google, Andrew Fong @ Dropbox, Fernanda Weiden @ Facebook) for putting on such a relevant and insightful conference.
Want to learn more about Moogsoft’s products? Contact us at firstname.lastname@example.org for more information, and be sure to connect with us on Linkedln, Twitter, Facebook, and Instagram to stay up-to-date with Moogsoft. You can also sign up for our monthly newsletter, Across Silos.