Applying AI and ML to its huge and complex IT environment has yielded great benefits for the global managed services provider
As part of its pledged “fanatical” commitment to the success of its 125,000 customers across the globe, Rackspace optimizes its IT operations to the fullest.
That way, this leading MSP ensures the highest quality and reliability of its vast portfolio of IT services across public and private clouds, and dedicated servers.
“We support everything from gaming to e-commerce, telecommunications, medical, aviation -- you name it. Rackspace hosts customers that are the backbone of these industries,” said JP Gonzalez, Principal Engineer at Rackspace.
As part of its continued quest for excellence in IT operations, Rackspace has now turned its attention to AIOps, recognizing the technology’s ability to help it further streamline and automate its increasingly complex IT environment.
“Managing large multi-cloud fleets poses challenges beyond the standard single-tenant scope,” said Gonzalez, who talked about the company’s journey to AIOps during the event “Moogsoft Enterprise 8.0: The Virtual NOC Is Here.”
Here we offer highlights of the event, which you can also watch in its entirety.
IT operations challenges
Gonzalez, whose role bridges the gap between the leadership and the technical “boots on the ground” teams, outlined the IT operations challenges Rackspace faces as a global MSP, and explained why early incident detection, precise cause identification and quick resolution are key.
For starters, its complex SLAs (service level agreements) with customers require immediate notification of issues. In addition, shared service issues often impact more than one customer. And obviously, outages must be avoided at all costs.
Of course, the scale and complexity of its IT environment is considerable. It's composed of products from many vendors, covering a wide scope of technologies and platforms. Management becomes challenging with the proliferation of hybrid multi-cloud operations and unique vendor API's.
Rackspace also must integrate the technology portfolios and CMDBs of companies it acquires. Finally, it must comply with its change management policies, as well as those of its customers.
Bottom line: Rackspace looked to AIOps for improving IT operations workflows and increasing efficiency, such as boosting system visibility and ticketing processes.
The path to AIOps
The first step Rackspace took on its path to AIOps was defining what applying AI to its IT operations meant to it, and what should be its scope, Gonzalez said. Rackspace concluded that it would combine orchestration with supervised machine learning, in order to achieve:
- Better telemetry analysis
- Faster root cause determination
- Self healing where automation exists
- Cost reduction
- Better customer experiences through faster resolution
Byproducts of these improvements would include better governance, increased employee productivity, stronger support and higher uptime, according to Gonzalez.
The Moogsoft partnership
At this early point in its AIOps journey, Rackspace chose Moogsoft as its partner, leveraging not only Moogsoft’s industry leading technology, but also its expertise.
“Moogsoft has been fantastic about providing their experiences and their knowledge, and sharing it in a very constructive manner,” he said.
Rackspace outlined its criteria for success, and assessed the opportunity to sharpen its IT operations monitoring tools and processes with AIOps, including alert deduplication, event correlation, API integration, orchestration and automation.
Rackspace worked with Moogsoft to help design its data flows, and decide what to do for its user work experience and their workflows, which would change with AIOps and thus require a new way of thinking.
“The idea of chasing down every single alert dissipates, and the idea to try to find resolution to root cause problems becomes more valuable,” Gonzalez says.
Thus, Rackspace needed to determine what systems and processes it was going to create, augment or replace.
Rackspace decided to use an internal messaging bus that’s connected to multiple telemetry sources to transfer data to Moogsoft for algorithmic correlation and root cause analysis, and then send the data out to reporting, automation and support systems.
Results and progress
Currently, Rackspace has had the Moogsoft AIOps platform in place for about a year, with five monitoring sources and one CMDB linked to it. In a recent 30-day period, 1.7 million alerts were generated from 100,000 monitored devices.
With Moogsoft AIOps, Rackspace reduced alert noise by 99%, ending up with about 15,000 unique incidents, each with their probable root cause identified. About 20% of the incidents were made up of multiple alerts, and 13% required the involvement of multiple teams.
“We’re very appreciative of our relationship with Moogsoft and are looking to expand and encompass more and more of our assets,” Gonzalez says.
Specifically, Rackspace plans to increase the number of monitoring sources to more than 50, the number of CMDBs to six and the number of devices monitored to more than 500,000, generating more than 5 million alerts per month.
“This is where the journey starts,” he says.
Listen to the entire webcast to get all the details about Rackspace’s deployment of Moogsoft AIOps, the benefit it’s derived from it, and how it plans to further extend it. The webcast also included a question-and-answer session with the audience. Below is an edited transcript of the questions Gonzalez answered.
What systems do you have feeding events into Moogsoft?
We’re utilizing SCOM (Microsoft System Center Operations Manager), what used to be called Nimbus (now owned by Broadcom), Zenoss, a homegrown monitoring system from Rackspace, EMC’s VNX, a homegrown CMDB and a homegrown notification system.
What technology are you using for the message bus?
We use Kafka on a Google Cloud instance.
Can you elaborate on metrics like noise reduction and MTTR from your AIOps use?
The main challenge is to change the style of your work. We had to change our reporting metrics for our management stats because we’re moving from a transactional state to a project state where we’re resolving issues. It’s no longer a cookie cutter approach of let me bounce this work through queues and pass it around. The time reductions we’ve achieved have translated into the ability to resolve customer issues as opposed to just Band-Aid them.
How does Rackspace plan to handle the 50+ source integration points and control that complexity?
Very carefully. There’s a lot of work with that. Over time, we may consolidate some of the 50+ sources, but the telemetry points come from everything -- from computing network stacks all the way to facility management. In terms of the complexity, there’s a normalization process. As you know, a critical alert for one vendor may be an urgent severity alert to another vendor. At the same time, normalization of the descriptions and of that type of data, is a process. So we have built our own internal methods to manage the normalization of the data as we send it off to Moogsoft.
How much of your IT environment are you monitoring with Moogsoft AIOps right now, and what is your end goal?
We have a large part of our network stack, the majority of our compute stack, and a large part of our storage stack integrated into Moogsoft AIOps. We’re working to do everything in our data centers -- the monitoring systems in those data centers first, and as we’re doing that, we’ll start doing parallel efforts to start ingesting public cloud monitoring components into the system. Once we have that infrastructure built out and we’re watching all the infrastructure layer, we’ll start plugging in the application layer for all the services we provide, and then we’ll have full visibility across the hybrid model -- from the application, with security services included, to the infrastructure layer, and being able to correlate issues.
Watch the full event, which also includes a demo of Moogsoft Enterprise 8.0. This major release features groundbreaking capabilities for remote Ops teams to work effectively from virtual NOCs underpinned by AIOps.