Big Data Insurance – Is it needed for IT Operations?
Thursday January 10 2019
Is the cost of your Big Data Insurance outweighing the benefits to IT Operations?
We buy all types of insurance including life, car, health and other specialty programs. The insurance is in place to protect us from financial ruin. We must weigh the costs of the insurance verses the cost of what is being insured to see if it makes sense. The challenge is to find the right amount of insurance for the right cost to prevent a potential future financial ruin. Insurance is something we hope to never use but is required for our own protection.
In the fast-paced, agile modern business environment, companies are constantly evaluating how to innovate, reduce costs and do more with less. How much are companies spending on insurance policies for their data and analytics? Is the cost possibly out weighing the benefit?
As I’ve pondered this topic, I’ve wondered if we are heading in the wrong direction with big data, particularly looking through the lense of operational use cases. I would agree there are some uses for big data such as business analytics but do we need to store the volumes that are being generated. The question must be asked, what will we do with all of this data? Can some of it be scaled back? Do we need this for our day to day operations?
For many in IT Operations, the answer to service impacts is in the data lake and possibly at the bottom covered in wet green moss.
Big Data in IT Operations
Does your organization horde data and bear the high costs of management and storage? There are valid use cases for big data in IT organizations; however, I question the usefulness of hoarding data for the specific use cases around IT Operations.
When service impacts occur, various timers start ticking away. There is the customer satisfaction timer, the cost of operations timer, and potentially a financial penalty timer. The goal in IT Operations is to restore service as quickly as possible. The need for information is in real-time. Legacy approaches are being utilized to solve a real-time need with a historical and reactive posture.
For many in IT Operations, the answer to service impacts is in the data lake and possibly at the bottom covered in wet green moss. This way of restoring service is time consuming, takes high levels of knowledge and a sense to know where to fish in the data lake..
I see some eerily similar patterns of the data lake and long term network packet capture methodologies used in network monitoring and operations. In previous roles, I helped network teams implement long-term packet capture strategies to ensure optimal application delivery across the network. The costs to store a couple days or month of packets was hefty and that long term data was rarely used but for some unique edge cases. In the primary day to day operations, it was the protocol analysis of the last hour or 15 minutes that proved to be the most useful data needed to solve issues. When the network latency increased, it was good to know what was consuming the network resources at that time. Operations needed to know quickly which applications and end-points were the culprits. This could be found with recent data and rarely did the operations team dip into the deep vault of stored packet data.
Data privacy also becomes a consideration to make when capturing and storing vast quantities of disparate data sets. Not only is it costly to manage, but there are security and compliance vulnerabilities lurking in the lake like a hungry piranha ready to take a bite.
Returning to our question on the need for big data insurance. Does the cost structure and vulnerabilities of the data lake perhaps outweigh the benefits to the business around operational use cases?
AIOps – The Right Insurance for IT Operations
Most environments that rely on big data for operations wire the monitoring sources directly into the data lake. This includes streaming of time series metrics, events, logs and other pieces of data. When service impacts occur, the operators are left with combing through a plethora of data to sort out the false positives, symptoms and root causes. The data volumes are overwhelming and present challenges for the human mind to properly deal with and find the proverbial needle in the haystack.
The alternative is to rewire the data flows to optimize the information for quick problem resolution. By feeding the underlying monitoring source event data directly into the Moogsoft AIOPs platform, real-time analysis can be performed without the need for costly long term data storage and reactive postures. This helps to reduce data storage and management costs, mitigate the risks of data security vulnerabilities, and eliminate the exposure of private data.
Moogsoft AIOPs provides algorithmic noise reduction, similar alert clustering across domains and ultimately creates a collaborative team based workflow around situations that need to be resolved. This methodology yields faster mean time to detection (MTTD), faster mean time to resolve (MTTR) and ultimately high customer satisfaction with your services. The value of Moogsoft AIOPs compared to the reactive data lake approach is realized in lower operational costs, higher customer satisfaction and innovation that stays on track versus constant firefighting efforts.
Qualcomm – A Practical Example to Deploying AIOps
Zeeshan Sabir of Qualcomm, presented at the Evolve Conference 2018 and shared their journey of transformation. This journey involved using AIOps to help solve the big data problem for IT Operations.
As Zeeshan described, the journey to next generation automation and telemetry system began. The first step was to understand the problem in great detail and what specifically needed to be solved. The issues identified included:
- Silo approach to monitoring and operations
- Many disparate tools
- Exponential rise in event load
- No correlation of events
- Drowning in data and struggling to find useful insights
- Expanding MTTD (mean time to detection) and MTTR (mean time to resolution)
- Lack of knowledge sharing
- Silo based automation and diagnostics
- Dynamic infrastructure with fast pace of change
- Cloud migration initiatives
With the problem clearly understood and agreed upon, the approach had to be defined. This boiled down to four categories:
- Take an agile incremental approach
- Service context as opposed to silo context
- Machine learning and AI to solve the big data problem
- Simplify operations
An agile and iterative approach was the best way forward. As Qualcomm learned, adjustments would be made as they navigated the transformation. This required buy-in from the business application owners as well as the technical stakeholders. The most critical applications would get the initial focus with agreement from all parties. There also had to be agreement on what KPIs and dashboards would be useful from the technical audience to the executives making business decisions.
When service impacts occurred, each silo had visibility into their own domain. Qualcomm recognized the need to democratize the data to obtain a service context as opposed to a silo context. The silos were taken down and full data transparency was achieved.
Machine Learning – AIOps to Solve the Big Data Problem
Due to the data explosion, it was recognized that no human would be able to keep up. The old way of writing rules, defining filters and continuous metric and threshold tuning no longer sufficed. A futuristic product with AIOps was needed to help reduce noise and correlate alerts. The old-way resulted in expanded MTTD and MTTR and it continued to get worse day by day. The transformation to Moogsoft AIOps became a necessity as opposed to a nice to have.
In the old way, the workflow enabled the silo based approach. Each team had access to their own data, knowledge base information, diagnostic scripts and automations. It was critical to closely integrate the system of engagement (Moogsoft) with the system of record (ServiceNow). This allowed for transparency when incidents were being worked. The Moogsoft Situation room contains a collaborative workflow capturing notes, diagnostic steps and the final remediations or resolving steps. All of the existing Chef, Puppet and Ansible automation scripts could be leveraged from within the Situation Room by the operators. During repeat issues, Moogsoft presents similar Situations to the operator. This prevents longer MTTR by providing all of the notes, diagnostics and remediations without having to go searching and mining for the data.
The Qualcomm story provides a strong blueprint for how to approach the transformational journey for other customers. It takes a mixture of people, process and technology to make it all work out and achieve the desired results.
Watch the video by Zeeshan to get his perspective of how Qualcomm navigated this transformational journey. The video is available to watch at the following link: https://player.vimeo.com/video/269458768
Moogsoft is a pioneer and leading provider of AIOps solutions that help IT teams work faster and smarter. With patented AI analyzing billions of events daily across the world’s most complex IT environments, the Moogsoft AIOps platform helps the world’s top enterprises avoid outages, automate service assurance, and accelerate digital transformation initiatives.