Insights from SREcon 2016: Best practices & candid discussions of incident management.
SREcon 2016 saw companies from around the globe convene to discuss the issues that their site reliability engineers struggle with as their businesses continue to scale – including the quasi-existential question of “what makes a great SRE team anyhow?” It seems that many companies cobble together crack-teams of software engineers and operations staff to form their site reliability engineering functions. But regardless of how these teams are structured, they are all working to automate processes that have historically been dependent on human labor. These processes tend to revolve around performance, availability, efficiency, monitoring, incident management, latency, and reliability.
Speakers from top global companies presented best practices and candidly discussed some of the limitations in their approaches. There were two panels that I found especially fascinating (as I just wrapped up a blog on the evolution of root-cause-analysis) starring two of the most monstrously successful companies of the modern day: Google and Facebook. Here’s some of the key insights into how they approach IT incident management.
Facebook Ponders the Question “What IT Alerts Should Humans be Paying Attention To?”
Brian Smith, production engineer at Facebook, started us off by presenting a working definition of the criteria Facebook uses to determine if an IT event should ever reach the human eye – this process is called SAR, which stands for Signal, Actionability, and Relevancy.
- Signal – Was this a false positive? Yes? There must not be enough signal!
- Actionability – When I get this alert, can I do something about it right this second?
- Relevancy – When I get this alert, did another one fire off that tells me the same thing or overlaps? Yes? Delete one of the alarms.
Smith suggests that there is improved actionability and relevance by using the SAR method and keeping focused on only one alert per stack. He explained that this was the approach Facebook used when they eliminated 97% of alerts, therefore reducing the noise they were receiving on a daily basis and improving overall operational efficiency.
Google Asks “What Metrics Matter the Most in IT Incident Management?”
Sue Lueder, Program Manager at Google, instructed her team to adopt a tagging system in their post-mortem analysis to help pinpoint five key fields they believed were most important for optimizing IT Incident Management:
- Start time
- End time
Google uses this system accompanying a severity scale that includes near-misses and cascading failures in order to determine the threshold for future alerts, continuously asking their teams to determine whether or not “this is something you’re willing to live with if it happens again.”
Is the Facebook and Google IT Incident Management Approach Right for Your Business?
From post-mortem tagging to defining actionable alerts, these two tech industry behemoths (“2 of the 4 horsemen of the digital apocalypse” according to L2’s Scott Galloway) have undertaken huge efforts to perfect their incident management routines and have all successfully evolved small scale incident management within their organizations.
But not every company is Facebook or Google. For the rest of us, techniques such as throw-away commodity solutions, over-staffing operators, or creating massively parallelized data centers are simply not feasible options.
If you would follow these processes, you would still fall short of a real-time detection of net new problems and elimination of phantom alerts. The answer to scaling operations is to rely on computers to run what these companies have humans managing today. The switch to allowing machines to tend to ongoing analysis while still relying on humans for problem solving and innovation will allow for only better results for their businesses.
Facebook’s methodology could be great for small production environments and for tackling incidents that are tied to a singular root-cause. Unfortunately, this is rarely the case for modern businesses, so it would be extremely risky to throw away all but one alert per stack as we tend to see multiple causes to event storms (this is further supported by the Forrester study that notes that 74% of IT incidents are not reported by IT but by other parties, including end users – that’s not a good look).
Instead, utilizing a solution that not only correlates the multiple alerts throughout your infrastructure by picking out anomalies and patterns within their data, but does it using insights on problems that you’ve experienced in the past, could be the key to improving your overall service quality, as contextualizing data and understanding the story behind those metrics would allow for a stronger and more timely response.
The effectiveness of Google’s system could also be improved by adding a real-time analytics solution, as it could transform their process such that any time an operator resolved an issue, all of the key metrics that they needed were stored in real-time and catalogued against that specific “situation” (“situation” as defined by a group of correlated or “clustered” events) so they could generate their five key field analytics in the moment instead of having to go back, inspect, and tag everything in a post-mortem process – a process which is expensive (especially without tools that dynamically capture forensic activity).
In addition to these key fields, we think it would be hugely beneficial to add diagnostic steps and critically-resolving action metrics so as to compare and contrast the likeliness between event clusters (“situations”) thus lowering MTTD (mean time-to-detect) and pushing to resolve issues faster by using that historic data to help guide future responses.
At Moogsoft, we hold a firm belief that the future of event data analysis has to be focused on crunching data as it occurs in real-time. While we will always need intelligent human governance, a real-time Incident Management solution such as Moogsoft AIOps, which uses machine learning techniques to dramatically decrease and contextualize the amount of actionable events that are passed along to operators, will necessarily carry through to effective remediation and directly impact cost-reduction.
Or, in other words, companies that use an automatically and ever-adapting incident management model props the door open to save operational costs and free humans up to do what humans do best: innovate.