Zeeshan Sabir of Qualcomm presented at the Evolve Conference 2018 and shared their journey of transformation. This journey involved using AIOps to help solve the big data problem for IT Operations.
As Zeeshan described, the journey was one to implement a next generation automation and telemetry system. Their first step was to understand the problem in great detail and specifically what needed to be solved.
The issues identified included:
- Siloed approach to monitoring and operations
- Many disparate tools
- Exponential rise in event load
- No correlation of events
- Drowning in data and struggling to find useful insights
- Expanding MTTD (mean time to detection) and MTTR (mean time to resolution)
- Lack of knowledge sharing
- Siloed automation and diagnostics
- Dynamic infrastructure with fast pace of change
- Cloud migration initiatives
With the problem clearly understood and agreed upon, the approach had to be defined. This boiled down to four categories:
- Take an agile incremental approach
- Service context as opposed to silo context
- Machine learning and AI to solve the big data problem
- Simplify operations
Incremental Approach
An agile and iterative approach was the best way forward. As Qualcomm learned, adjustments would be made as they navigated the transformation. This required buy-in from the business application owners as well as the technical stakeholders. The most critical applications would get the initial focus with agreement from all parties. There also had to be agreement on what KPIs and dashboards would be useful from the technical audience to the executives making business decisions.
Service Context
When service impacts occurred, each silo had visibility into their own domain. Qualcomm recognized the need to democratize the data to obtain a service context as opposed to a siloed context. The silos were taken down and full data transparency was achieved.
Machine Learning – AIOps to Solve the Big Data Problem
Due to the data explosion, it was recognized that no human would be able to keep up. The old way of writing rules, defining filters, and continuous metric and threshold tuning no longer sufficed. A futuristic product with AIOps was needed to help reduce noise and correlate alerts. The old way resulted in expanded MTTD and MTTR, and it continued to get worse day by day. The transformation to Moogsoft AIOps became a necessity as opposed to a “nice to have”.
Simplify Operations
In the old way, the workflow enabled the siloed approach. Each team had access to their own data, knowledge base information, diagnostic scripts and automations. It was critical to closely integrate the system of engagement (Moogsoft) with the system of record (ServiceNow). This allowed for transparency when incidents were being worked. The Moogsoft Situation Room contains a collaborative workflow capturing notes, diagnostic steps, and the final remediations or resolving steps. All of the existing Chef, Puppet, and Ansible automation scripts could be leveraged from within the Situation Room by the operators. During repeat issues, Moogsoft presents similar Situations to the operator. This prevents longer MTTR by providing all of the notes, diagnostics, and remediations without having to go searching and mining for the data.
Does IT Operations Need Big Data Insurance?
As the Qualcomm story illustrates, companies are constantly evaluating how to innovate, reduce costs, and do more with less. In this fast-paced, agile business environment, how much are companies spending on insurance policies for their data and analytics? Is the cost possibly outweighing the benefit?
We buy all types of insurance — including life, car, health, and other specialty programs. The insurance is in place to protect us from financial ruin. We must weigh the costs of the insurance verses the cost of what is being insured to see if it makes sense. The challenge is to find the right amount of insurance for the right cost to prevent a potential future financial ruin. Insurance is something we hope to never use, but is required for our own protection.
In pondering this topic, it’s easy to wonder if we are heading in the wrong direction with big data, particularly looking through the lens of operational use cases. There are some uses for big data such as business analytics, but do we need to store the volumes that are being generated? The question must be asked: what will we do with all of this data? Can some of it be scaled back? Do we need this for our day to day operations?
For many in IT Operations, the answer to service impacts is in the data lake and possibly at the bottom covered in wet green moss.
Big Data in IT Operations
Does your organization horde data and bear the high costs of management and storage? There are valid use cases for big data in IT organizations; however the usefulness of hoarding data for the specific use cases around IT Operations must be questioned.
When service impacts occur, various timers start ticking away. There is the customer satisfaction timer, the cost of operations timer, and potentially a financial penalty timer. The goal in IT Operations is to restore service as quickly as possible. The need for information is in real-time. Legacy approaches are being utilized to solve a real-time need with a historical and reactive posture.
For many in IT Operations, the answer to service impacts is in the data lake and possibly at the bottom covered in wet green moss. This way of restoring service is time consuming, plus requires high levels of knowledge and a sense to know where to fish in the data lake.
There are some eerily similar patterns between the data lake and long term network packet capture methodologies used in network monitoring and operations. In previous roles, I helped network teams implement long-term packet capture strategies to ensure optimal application delivery across the network. The costs to store a couple days or month of packets was hefty, and that long term data was rarely used except for some unique edge cases. In primary day to day operations, it was the protocol analysis of the last hour or 15 minutes that proved to be the most useful data needed to solve issues. When the network latency increased, it was good to know what was consuming the network resources at that time. Operations needed to know quickly which applications and end-points were the culprits. This could be found with recent data, and rarely did the operations team dip into the deep vault of stored packet data.
Data privacy also becomes a consideration to make when capturing and storing vast quantities of disparate data sets. Not only is it costly to manage, but there are security and compliance vulnerabilities lurking in the lake like a hungry piranha — ready to take a bite.
Returning to our question on the need for big data insurance. Does the cost structure and vulnerabilities of the data lake perhaps outweigh the benefits to the business around operational use cases?
AIOps – The Right Insurance for IT Operations
Most environments that rely on big data for operations wire the monitoring sources directly into the data lake. This includes streaming of time series metrics, events, logs, and other pieces of data. When service impacts occur, the operators are left with combing through a plethora of data to sort out the false positives, symptoms and root causes. The data volumes are overwhelming and present challenges for the human mind to properly deal with and find the proverbial needle in the haystack.
The alternative is to rewire the data flows to optimize the information for quick problem resolution. By feeding the underlying monitoring source event data directly into the Moogsoft AIOps platform, real-time analysis can be performed without the need for costly long term data storage and reactive postures. This helps to reduce data storage and management costs, mitigate the risks of data security vulnerabilities, and eliminate the exposure of private data.
As Qualcomm discovered, Moogsoft AIOps provides algorithmic noise reduction, similar alert clustering across domains, and ultimately a collaborative team-based workflow around situations that need to be resolved. This methodology yields faster mean time to detection (MTTD), faster mean time to resolve (MTTR), and ultimately high customer satisfaction with your services. The value of Moogsoft AIOps compared to the reactive data lake approach is realized in lower operational costs, higher customer satisfaction, and innovation that stays on track versus constant firefighting efforts.
What’s Next?
The Qualcomm story provides a strong blueprint for other customers on how to approach their own transformational journey. It takes a mixture of people, process, and technology to make it all work out and achieve the desired results.
Watch the video by Zeeshan to get his perspective of how Qualcomm navigated this transformational journey. The video is available to watch at the following link: https://player.vimeo.com/video/269458768