Millions of American Football fans like to get more involved with the sport by playing in a fantasy football league. Fantasy football is a virtual betting game where participants get to manage their own team and compete for points that are reflected by the actual performance of NFL season games.
In the words of my Moogsoft colleague Kevin, “I play fantasy football to have a real stake in the game.” Whether your fantasy football investment is emotional, monetary, or both, it puts you in the shoes of an NFL general manager, and that responsibility tends to be taken very seriously by fans who run fantasy teams. According to the Fantasy Sports Trade Association, over 57 million people in the US and Canada participated in 2015. These numbers are surely even larger this year.
The most widely used site for fantasy football is ESPN. According to CNN Money, the site had about 7.1 million unique users last year. Last month, ESPN reported that 1.7 millions teams were drafted in less than two days, during a televised preseason football event.
On Sunday, September 11th, the ESPN Fantasy Sport Platform experienced a major outage that left fantasy players unable to access the application for hours. When trying to view their league on both the web and mobile app, users were directed to a page with the message, “There was an error trying to reach Fantasy Football.”
What makes this outage particularly detrimental to ESPN is that the first official Sunday of football season is considered to be the most important day of the season for fantasy football, as users tend to actively make modifications to their teams to put themselves in a good position for the rest of the season. Based on the events that take place during a game, users may want to cut players, add players, bench players, or even make trades.
In other words, users are dependent on real-time interaction with the fantasy football application.
ESPN Fantasy Football users were outraged with the technical issues from Sunday, making their frustration viral across social media. Even Senator Claire McCaskill (D-MO) was upset, and took to Twitter to voice her frustration.
Worst yet, this isn’t the first time ESPN Fantasy Football has experienced an issue; the application went down during the first week of the 2014 NFL season.
How Did the Outage Occur?
On Sunday afternoon, ESPN shared the following statement:
“ESPN Fantasy is restored and we will continue to monitor. We identified a backend data access issue and resolved as quickly as possible. The issue did not impact data for teams, leagues or rosters. We sincerely apologize to all ESPN fantasy users.”
Based on our experiences while evaluating similar situations, it seems likely that the ESPN outage was the result of a scaling issue. These issues often occur when there’s a misconfiguration of the database, which can easily create a bottleneck, and eventually impact the application. This type of issue is rather common, as it’s challenging to accurately test for the impact of rapid shift in scale before you’re in production.
Is there a way they could have avoided this issue?
Machine Learning for Real-Time Service Insights
By now, large IT organizations are leveraging best-of-breed monitoring tools to generate telemetry across their production stack. While this telemetry is crucial for operations teams to understand any potential impact to business services, the sheer scale and rate of change in modern IT environments can result in massive volumes of alerts across disparate systems, leaving Ops overwhelmed and reactive in managing incidents.
Traditional approaches to this issue are to use 1990s management tools like IBM Netcool or CA Spectrum to use manually built rules and filters to reduce noise and identify relationships. In the ‘90s, this approach was magnificent. But today, it’s impossible to anticipate every possible scenario that could impact service. Changes are occurring on a sub-second basis and the rules being relied upon are probably flawed.
By leveraging machine learning purpose-built for IT Operations, organizations can automate the reduction of alert noise and the correlation of alerts across their applications, network and infrastructure. This approach provides real-time insights to potential service impact so that Ops teams can address incidents before they affect customers. In the case of ESPN, they could have leveraged machine learning to identify early warnings that queries were running longer than normal, and that an incident might occur.
Leading organizations like GoDaddy and HCL use Moogsoft today to get early warnings of service impacting incidents through the use of machine learning. Moogsoft is able to ingest massive volumes of alerts from your monitoring ecosystem, massively reduce the noise, and identify previously unknown relationships across your production stack to create clusters of alerts — we call them Situations — that isolate the existence of an incident. Ops teams then have a wealth of Situational context to understand the root-cause, and resolve the underlying issue.
We live in a time when service outages mean lost business. In order to stay competitive, organizations need to take a modern approach that automates manual and error-prone tasks through the use of machine-learning.
About the author Sahil Khanna
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.