What is it about the legacy systems used by Delta, along with other major airlines, that have lead to such major outages? Is there an easy solution?
Malfunctions occur — it’s an inevitable fact of operating any complex system. But does that justify interrupting the lives of thousands of valued customers?
On Monday of this week, Delta Airlines experienced a 12-hour outage that resulted in the cancellation of over 2000 flights and $5 to $10 million in lost revenue for the airline. More importantly, the indirect consequences of a tarnished brand and lost customer loyalty could pose a real problem for the airline looking forward.
This news brings less of a shock than it would just a few years ago, due to a major outage from competitor Southwest Airlines last month, the latest in a string of airline network outages over the last couple of years.
Managing malfunctions is part of nearly every job. If a pilot is able to consistently navigate through technological malfunctions that could potentially put people’s lives at risk during a flight, why can’t operations teams do the same for airline transactional systems?
The answer is — at least in part — that pilots have access to top-of-the-line technology on their dashboards that intelligently presents real-time performance data in a way that maximizes the signal-to-noise ratio, while major airline IT ops teams are stuck with legacy systems (in this case IBM Netcool) that can no longer cope.
Why Did the Delta Outage Occur?
While customer-facing applications from leading airlines have improved dramatically over the past several years, the transactional systems on the back-end are still composed of, and managed by, legacy technologies.
The Delta outage was apparently caused by a system failure in a Atlanta data center, forcing Delta to restart the entire system over a 12-hour period.
UPDATE 8/11/2016: According to Delta, they’re still investigating a localized, internal power outage that caused their main system to go down. Once the power was restored, their legacy backup systems failed to reboot.
As the Wall Street Journal points out, “Delta’s problems raise questions about whether today’s carriers, larger and busier than ever thanks to a recent wave of mergers, are too reliant on systems dating back to the 1990s. For CIOs, the day’s events underscore the continuing relevance of another IT artifact: The CIO’s role in keeping the lights on.”
When these systems were built, large organizations owned just hundreds of servers. Development teams were making just tens of deployments per year, meaning that things were basically static compared to today. Performance and availability metrics were at a manageable volume. Fast forward 10-15 years, and all of that has changed. Change and scale is constant, and systems built for 1990s technology standards cannot possibly cope with today’s software-defined and highly virtualized world.
As Gary Leff, a specialist on airline-loyalty programs, explained to the WSJ in another article, “It is a miracle that the systems work so well most times, given that they are legacy systems grafted onto other legacy systems, meaning airlines can’t possibly be fully prepared for every circumstance that could cause a problem.”
Leff is absolutely correct in pointing out that airlines with legacy systems can’t be fully prepared for every circumstance that could cause a problem. However, is he implying that airlines with modern systems can?
The truth is, these airlines and the CIOs responsible for keeping the lights on need to completely change their mindset.
Airlines Must Adopt a Data-Driven Approach to IT Operations Management
It is impossible to anticipate every potential circumstance that could impact service. This is the reality of the software-defined business, where changes occur on a sub-second basis. By relying upon legacy systems that use manually built rules to tell you when something noteworthy occurs, or when multiple events are related, you are vulnerable for an outage every time your infrastructure changes, and those rules are no longer relevant.
Major airlines need to adopt modern systems that leverage machine-learning to analyze patterns and anomalies in real-time. Moogsoft, for example, is used by leading enterprises like Cisco and GoDaddy to ingest billions of IT events per day and apply machine-learning algorithms to detect malfunctions in real-time, and understand complex relationships so that ops teams can address potential issues before they cause any impact to service.
In fact, back in the early ’90s, Moogsoft’s founders actually invented the core legacy system (IBM Netcool) currently being used by Delta — but it worked back then! If you speak with the Moogsoft founders, they will tell you that they started Moogsoft to address the challenges of modern IT that have transformed beyond recognition since they invented Netcool.
In order to address issues like the Delta outage this week, a shift in mindset is key. The role of the CIO needs to shift towards becoming the ‘IT Transformer’ — they need to evolve their infrastructure, monitoring, and teams’ operational workflows.
As Delta CIO Rahul Samant says, “Good, safe IT and doing it right — that’s table stakes.”
With modern technology available today that can ensure ‘good, safe IT,’ there’s no excuse for continued outages. While interfering with 15-year-old technology that’s running across your entire infrastructure is understandably concerning, complacency is no longer an option if you want to stay competitive.
About the author
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.