Yahoo: From Alert Fatigue to Actionable Operational Insights
Moogsoft helps Yahoo distill millions of alerts every day into the situational insights that matter
January 1, 2023
Yahoo is the parent company to some of the most popular sites on the internet including Yahoo Sport, Yahoo Finance and other household names such as TechCrunch, AOL, Tumblr, and Engadget. In September of 2021, Apollo Global Management acquired Yahoo (formerly known as Verizon Media Group, itself formerly known as Oath) for $5 billion.
As a company built through acquisitions, Yahoo found the delivery of its hundreds of media services dependent on an extraordinarily complex and highly-heterogeneous technology environment. One that consisted of disparate legacy systems, cloud systems, and various types of infrastructures. Yahoo sought ways to streamline.
Those efforts began with a substantial shift to Amazon Web Services, as well as public cloud services from Google and Microsoft. To streamline application development and management, Yahoo also increasingly embraced microservices, continuous delivery, and DevOps. These moves helped teams to deliver application enhancements more rapidly, test without human intervention, and deliver their software with increased agility.
Still, throughout the transition, when it came to effective operations management, the IT operations team faced significant challenges they would need help to overcome.
“What Moogsoft offers in terms of its technology goes far beyond what other vendors make available”
The Need to Move from Event Alert Overload to Situational Context
Yahoo’s infrastructure and the applications it supports remained enormously interdependent and complex. The underlying application infrastructure and loosely coupled microservice-based applications mean a breakdown anywhere in the service-chain could kick-off thousands of alerts and cause multiple application or service failures. The environment is so complex that operations teams found their traditional operations management toolsets unable to consume the vast number of events and overwhelmed with alerts — they were unable to identify the root cause of potential system issues and service interruptions.
Consider this: the Yahoo infrastructure and supporting systems that power the 424 media services generates roughly 2 million alerts a day. The team needed to be able to find the signal through all of that noise and identify the alerts and situations that could have a real service impacts on application availability and performance.
According to Devan Franchini, production operations software engineer at Yahoo, operations teams would be overwhelmed with alerts, and not be able to see the full context of the events behind the alerts. “Engineers would get an alert and move to resolve the situation. They’d then find a host or some other asset was not available. They’d create an incident ticket, but that failure already had an incident ticket because it was part of a larger outage underway,” Franchini says.
This meant teams wasted an excessive amount of time trying to triage specific symptomatic incidents because they couldn’t see the entire situation clustered into a context that made sense. “People couldn’t see the entire scope of impact,” he adds. There was also a broader business challenge: being able to see operational event context across all the business units, especially with a portfolio of several services, such as email, that span across those business units.
“You don’t have to worry about having to move to another platform because Moogsoft is constantly growing and improving. That’s also very important to us”
Getting to the Signal Needed to Proactively Detect Events Earlier, and Swiftly Fix Problems
There was only one IT operations platform Yahoo found that could provide everything they needed: Moogsoft. Powered by purpose-built machine learning algorithms, Moogsoft is the pioneering AI platform for DevOps, SRE and ITOps teams. Moogsoft reduces alert noise to the point that these teams can see the actionable situations that need immediate attention and are the root cause of underlying problems. Moogsoft achieves this by removing the alerts that don’t matter and then correlating similar alerts into a clustered situation. The platform then provides a root cause suggestion and enables multiple teams to collaborate and more rapidly remediate incidents effectively. “What Moogsoft offers in terms of its technology goes far beyond what other vendors make available,” Franchini says.
Yahoo utilized Moogsoft’s direct Datadog REST adapter integration, as well as integrating Moogsoft with ServiceNow, where the operations team receives ServiceNow webhook information and leverage API calls to automate certain functions, such as ticket association.
The Yahoo Operations team deployed Moogsoft within both its production and test environments. “Our test environment has assisted us in fine-tuning our situation clustering logic before on-boarding it to production,” he says.
In the production environment, Moogsoft is helping the operations team to monitor all four hundred and twenty-four unique business services, as well as Yahoo’s internal infrastructure. Moogsoft ingests two million daily raw events, using six monitoring agents for 60,168 individual data sources and distills those 2 million events down to 10,000 alerts, close to 4,000 situations, within Moogsoft. That is a 99% reduction in noise that is no longer hitting the IT operations teams.
Yahoo’s IT operations team leverages Moogsoft to get a comprehensive view of services that span across the entire AOL and Yahoo portfolio of web services. “We’ve done everything we could to make sure that everyone is monitoring the same things. If there is an alert, it is handled through Moogsoft,” Franchini says.
To date, Moogsoft has helped Yahoo avoid several costly outages. “Our operations center engineers have been able to identify and assist in remediation of financially impacting outages, saving us a great deal of money,” he says.
Whether it’s a business service that starts to falter, or an entire data center, Moogsoft helps to ensure that the appropriate teams for the appropriate services are notified and can quickly remedy the situation. “Moogsoft provides us the ability to see how situations will be clustered. How Moogsoft clusters situations is very helpful to us and saves us time. As we start incorporating more AI functionality, it will save us even more time as well as improve the dynamics we have with some of our supporting teams,” he says.
More than the Moogsoft platform, Franchini and the team appreciate the company behind its technology. “I love the nature of Moogsoft. They are willing to work very closely with us. And the fact that the company is so very flexible and constantly improving, and not stagnant like so many other vendors. “We don’t have to worry about having to move to another platform because Moogsoft is constantly growing and improving. That’s also very important to us,” he says.
Internet Media Technology
The IT operations team found their traditional operations management lacking — they were unable to effectively identify the root cause of potential system issues and the source of service interruptions.
- Reduced alert noise
- Monitoring across multiple business units
- Moogsoft’s AI powered correlation surfaces anomalies from both Amazon CloudWatch and on-premises tools and correlates related data into a single incident
- A comprehensive view