On Tuesday morning of last week, Salesforce.com experienced a massive outage during a site switch to their NA14 (North America 14) instance.
This event quickly became a viral meme and one of the top trending tweets across the internet. Even my grandmother had heard about the Salesforce incident.
Despite the slip-up (which I will come back to), I must applaud Salesforce on their transparency and responsiveness. Salesforce provided constant visibility and status updates through their Salesforce.com Trust site. Most large organizations make an effort to hide poor service. Additionally, service stakeholders, including CEO Mark Benioff, were incredibly responsive to customers.
It really shows that Salesforce.com not only talks about a customer-first culture, they really do live and breathe their customer-first culture, which is why Salesforce.com is likely doing everything they can to make sure an equivalent of #NA14 doesn’t happen again.
Perhaps some politicians may learn something from this—Be Honest! Nah, will never happen.
But how did such a massive outage occur at a topnotch company like Salesforce? It’s well known that they have invested in a collection of monitoring tools to track performance trends across their applications and Servers in real-time, so how did #NA14 slip through the cracks?
Single Root-Cause is the Thing of the Past
When I first heard some of the forensics behind the Salesforce.com issue, I was reminded of some of the first data we ever processed through Incident.MOOG during a proof of concept engagement. It almost exactly mirrored the scenario suffered by Salesforce.com.
The customer, forensically assessing their monitoring data after the issue had occurred, attributed blame (root-cause) for the application outage to their Oracle Database.
And here are the issues:
- Often, the first time you hear about any issues is when you receive the call from the customer.
- Due to complexity and a lack of situation awareness across the stack, from network through datacenter to applications, we’re heavily focussed on time series charts, using time series deviations as indications of issues.
The first problem with #2 is that the time series deviation does not represent or indicate the impact of an issue.
The bigger problem with #2, though, is that the time series deviation (a single fault) is both not the single fault (root-cause), nor is it the cause at all. Instead, the time series deviation is the symptom of a combination of faults that coincide in time, resulting in performance or capacity degradations that lead to the applications becoming un-performant, or other catastrophic chains of issues.
The scenario that Moog detected (where, if you remember from above, the customer attributed blame to the Oracle database) went something like this:
- there was a load balancer failure [Event];
- there was a network link failure [Event];
- the /[root] filesystem on a database server was filling (Clear Severity Events!)—clearly there was a rogue process somewhere on that machine;
- the database ran out of space and so the application was switched to an alternate database server; however without the load balancer and without network capacity, the demand on the database server from multiple applications caused the database server to grind to a halt, ultimately core dumping;
- users were down.
The performance dashboards in use by the customer clearly showed the database performance issues and the application performance issues, however they lacked the “event” awareness.
Moog detected the events that caused the issue; clustered those events together, and highlighted the situation after a couple of hours without being resolved; saw the database core dump; and consequently brought the application down.
In other words, Moog detected the unknown unknown without the need to train Moog with “what is a failure.” That’s because, in modern infrastructures, we’ve all not seen all the failure scenarios!
The important thing to note here is that all the information was available, but it was owned and operated by many different silos.
- The network team have their data—they presume that their network is highly tolerant of failures, so if a single link fails, it gets put into a queue to be addressed in the normal line of work.
- The storage team doesn’t look or assess clear severity events. (In fact, this is true across most companies!) More and more, event messages are badly categorized by severity—especially log messages.
- The database team is unaware of their “customers.” They are simply a service, and so were unaware of the capacity requirements that could impact a single database server in the event of a catastrophic network event. The database team are unaware of any issues which impact the server capacity.
- The application team is wholly lacking in information from other infrastructure support teams, so they’re reliant on their APM tools. These tools are typically showing trends and not events. Even when an event is triggered, most of the time it is a consequence of some upstream infrastructure issue, so the app team is inundated by phantom issues.
This isn’t so different from what occurred at Salesforce.com, where a combination of “things” happened, which (if we understand by reading around the edges a little) saw one or more faults leading to a fail-over scenario being triggered, unfortunately without the capacity to sustain the required customer load, which led to subsequent failures.
Organizations like Salesforce manage incidents and outages by assessing events in a linear process. Their IT operations silos (network, application, storage, etc.) each rely on a unique and disparate set of tools, work across layers of expertise separated by ticketing tools, and respond to incidents and outages reactively.
If you have a disparate set of tools being operated by siloized teams located across different geographies with divided responsibilities and no proactive and collaborative relationship between them, how can you expect to join the dots and put all the pieces of the puzzle together to understand a complex IT incident in a highly distributed environment?
Moog would have alerted Salesforce.com to an issue that needed actioning many hours, if not days, before the capacity and performance degraded to the point where the #NA14 incident became a meme.
- Operations can no longer afford to work in isolated islands of technology;
- there is a need for a single pane of glass across the operations technology silos;
- without situational awareness and collaboration, operations’ productivity suffers;
- without the ability to detect “unknown unknowns”—i.e. detect without training or models—operations will experience a lot of downtime;
- time series deviations are not the way to detect anomalies!
It’s the ultimate operations and support management challenge and exactly what we founded Moogsoft to solve.
- detects actionable issues much earlier;
- delegates action to the appropriate teams as clusters of alerts, enabling operators across silos to…
- diagnose the causality and impact quickly, in order to remediate the causal issues, (often) before services or customer facing applications are impacted.
By leveraging an approach like Moog, businesses like Salesforce won’t have to deal with the SLA breaching, customer dissatisfaction and reputation damage which results from a serious outage like #NA14.
The beauty of what we’ve created here at Moogsoft is that you can leave your existing monitoring and service management tools in place. Moog will situation enable them. Even with Moog and the existing tools, the return on investment in the form of gains in productivity, change agility and service quality far outweigh the costs.
Spend less, do more, as we like to say! (Far more understandable than that Vorsprung durch Technik that Audi used to use as their slogan.)
So folks—get Moog and join the dots! Start with a Free Trial!