The complexity & rapid rate of change in modern networks is forcing IT orgs to reexamine their approach to service assurance.
74% of IT issues are detected by the End Users before IT Operations. This was the conclusion of a Forrester report coinciding with the three biggest transformations in the delivery of IT underpinned Services.
The factors that have driven customers to become the primary indicator of incidents are complex and combinative; the result of Operations Management decisions made dating back to the 1990’s.
Today, in a majority of cases:
- Operations teams react after the disruption has occurred
- Faults cause unexpected ‘cross-technology-domain’ consequences in modern fault tolerant application platforms. Single faults are rarely the root-cause
- Operations teams responding to faults in their own domain lack situational awareness of their relationship with responders to faults in other, up-stream technology-domains
The consequences are: delayed time to detect, increased time and effort to diagnose causality, increased time and cost of resolution (MTTR) and, correspondingly increased time and costs of Business Disruption.
The bad news is that while this ‘perfect storm’ of factors remain unresolved, IT operations will remain reactive and inefficient.
Who is to Blame? The Advent of Virtualization Technologies
From the mid-2000’s, the delivery of services underpinned by IT systems has gone through a revolution; virtualization has industrialized rapid change in Networking, Compute, Storage and now, Managed Service Providers with virtualized network functions.
Compute fails, that is a fact. Whether a hardware or software system is at fault, things fail. When things fail, they cause disruption. Virtualization is the natural solution to the problem of IT Systems’ low tolerance to failure.
Single Root-Cause Analysis, A Thing of the Past
In classical compute architectures, whether the delivery fabric consisted of a Mainframe or Open Systems platform, disruption of IT underpinned Services were caused by single faults, “the Root-Cause.”
Virtualization is the solution to protect against Single Root-Causes. The principle behind virtualization is that if some component of the IT fabric fails, another virtual element can quickly take its place, minimizing the impact upon the availability of the Service.
Things still fail, although because virtualized functions can be provisioned through automation very quickly, Services can be recovered more quickly causing less disruption.
The consequence, however, of virtualization upon application owners and businesses has been revolutionary and transformational.
Industrializing Agile Change
In the past, IT change was measured in Months and Years. Today, with the Business and Application owners needing to predict their IT capacity requirements in advance in order for Compute to be ordered and provisioned in time, Virtualization has enabled change in near real-time.
Where historically the CIO or IT Director was the gatekeeper to the consumption of IT Services, today he or she has had to become a Service Provider that can react to business change requests or lose the contract to deliver IT Services to other parties.
The underlying transformation to Virtualization has been the key factor that has made the traditional approach to delivering Operations Services reactive and inefficient.
However, Virtualization itself is not the problem. There are three agents responsible for inhibiting proactive Operations Support, which can be attributed across the three pillars of IT: People, Process and Technology.
- Technology: the limited tolerance to technological change
- Process: a plethora of reactive tools, but no single pane offering situational awareness
- People: the challenge of situationally enabling technology and multi-organizational silos
Industrializing Agile ChangeThe Technology Transition
IT infrastructures have changed dramatically over the last several decades. In response, new technologies for monitoring and supporting changing infrastructures were built – just not at the same speed.
In the 1980’s, we were dealing with monolithic suites based on extremely fixed hardware. They were very well-defined, top down, had tiny event rates, and saw very little change. There were only a few vendors, and they were all huge.
So the thing that really drove the first big generational change was the distributed computing wave. We started to see mass adoption of IP and UNIX, distributed systems and client servers – and as a result we started to see problems with service quality. There were lots of difficult problems buried in debugging client systems, event rates started to spike up, and we started to see lots of change.
Where we are now is in the extreme evolution of that – where the monitoring framework is completely patchwork and almost random, where there’s over 27,000 monitoring projects inside of GitHub, where everything is open-source and departmental, the event rates are through the roof, and configuration is chaotic, with changes occurring in sub-second cadences due to today’s virtualized, containerized, software-defined everything world.
And the biggest problem is – how’s your service quality?
The big thematic changes as you move from that ‘distributed’ to the ‘modern’ compute infrastructure can be shown in 3 pillars – CMDB, root cause, and the effect that it has on MTTR and MTTD.
CMDB has gone from accurate and static to AT BEST 80% accurate – and this is quite generous. If you think about that in the context of root cause and singular root cause, the big deal is that if you have an inaccuracy, even if you’ve got a root cause you have a very significant probability that you can’t do anything about it.
So in a dynamic and high scale environment, root-cause doesn’t converge. You cannot use a rules based system to even get to root cause. This is why you start to hear people talk about ‘probable root cause’ and ‘maybe root cause’.
So you’ve gone from a manageable number of workloads and tickets to an unbound number of tickets, being driven by a sea of red with no context.
So let me leave you with these questions:
Think about your environment, your monitoring, your service management tool chain. How does it look at 10x scale? Can the systems cope? Are you ready for it?
Are you automating your root cause today? I bet you’re not. I bet you’re looking at a sea of red. I bet you’re drowning in alerts. I bet you’re event management system isn’t really being used.
Think about how long it takes for you to commission something new into monitoring. How does that look when you’re doing 1000 new apps a day? Because that’s the world of DevOps.
The Visibility Void
Let’s go back to the statistic we began with: 74% of IT issues are detected by the End Users before Operations. A major source of this issue is our technology, as we discussed previously. Our technologies depend on an accurate CMDB (which really doesn’t exist anymore), assume a single root cause (which also doesn’t exist), and are inherently reactive by nature.
In order to improve the accuracy and speed of detecting and resolving incidents, we have seen an explosion of APM, NPM, Log Management, and other types of monitoring tools, built for each domain within IT. In fact, according to this 2015 Monitoring Survey, over 50% of organizations have 10 or more monitoring tools at their disposal. While each of these tools are adding unique value, maintaining service quality and availability is becoming even more challenging because of the ever growing visibility void across these disparate toolsets.
In reality, just because an application is impacted, for example, the application may not be the underlying issue and your APM tool may not have visibility into the root-cause. The same applies to the rest of your monitoring ecosystem, each tool offering a unique perspective of your IT environment, So, with all of these unique perspectives, which is the right perspective in any given situation? How are teams supposed to look across large sets of tools to properly analyze mission critical incidents?
IT Operations Remains Situationally Unaware
Every organization uniquely instruments their IT infrastructure, yet the problems they face are similar (although more complex and severe the larger the infrastructure). Each layer of IT has deep-rooted interdependencies and when incidents occur, a storm of operational data is presented from a disparate collection of tools. Organizations have no effective way to gain a holistic view across these toolsets and separate the signal from the noise. These are the questions that IT leaders are asking today:
- Do we have too many tools?
In reality, probably not. Those tools are all there to perform a particular function within a specific silo. You need to choose the best-of-breed tools to get the best results. Here’s a great article to shed more light on the growing trend of Composable Monitoring.
- Do we have too much data?
The truth is, you probably do. People can only process so many Incidents and burnout is inevitable. At the operator level, just 10’s-100’s of events /second can burn you out. Most people are tackling these by prioritizing events, meaning you stare at high priority events and ignore the rest. Even if they contain information that is relevant to what’s going on in your environment, there’s just not enough time in the day to get to them. What’s likely to happen is your P1 problems are going to escalate every single time and there won’t be any P1 issues being closed by Level 0 operators.
- What about filtering? Won’t that help?
No, filtering is an outdated technology. If you use filtering to get your data to a manageable volume, you will have to filter too aggressively and your visibility gap gets even wider.
- Will this problem get better over time?
Absolutely not. You will undoubtedly be rolling out new technology all the time and that will only add to the problem. Systems are becoming more application oriented and those apps are generating even more data in the form of logs, which are unstructured and chaotic. There’s also more shifts that are growing dramatically, like DevOps and IoT.
- What about old established techniques, ITIL books, knowledge articles, etc?
The truth is, no one has time for that. Things are changing so fast that every knowledge article will be out of date before you get a chance to use it.
While these are the right questions that IT leaders should be asking, it’s time to move forward and strategize how you are going to obtain a holistic view of your infrastructure and make sense of all of your operational data. Your service quality is depending on it.
When you have a variety of specialized and disparate tools built for the different domains of your IT infrastructure, the tendency has been to create organizational siloes of people to manage those individual tools. But why is this such a serious problem and how did this come to be?
In the late 80’s and early 90’s, we went through a transition of mainframe to open systems. We went from one machine to potentially hundreds, or thousands of machines, and you could no longer have a human being overlook each machine. This problem led to the creation of tools like Netcool, to consolidate information from all of these consoles.
When customers or end-users would call in for IT support, they needed to talk to experts. In those early days, all of those experts sat in one room: the application guy, the network guy, and the compute guy all sat and worked together. As IT infrastructures became even more complex, however, so did support.
In order to make user experience more pleasant and efficient, support was divided into separate tiers. A service desk was put in the middle and would then distribute requests and complaints to the appropriate silo. Unfortunately, IT support is not the most pleasant job (from both a lifestyle and compensation standpoint) so the industry saw lot of churn. As a solution, organizations turned to outsourcing.
So we pushed our support away from our users and our infrastructure. We went from a world where we owned and operated our entire infrastructure to a world where we owned and operated some of our infrastructure and outsourced the rest. This introduced serious communication barriers between people. Not only was the complexity of our infrastructures increasing, but introducing these new barriers was increasing the number of major incidents to deal with, increasing the finger pointing between vendors and support, and increasing the time it took get a problem solved.
In short, we no longer own and operate all of our IT. We operate in islands of vendors, domains, and expertise.
Doesn’t ITILv3 Fix This?
Today, organizations deal with these silos and the resulting communication barriers by following the formalized processed dictated by ITILv3.
Here’s what actually happens. Support sits in front of a computer looking at a sea of alerts, trying to figure out if any of them are from real incidents or not. If it appears to be real they will react to it by logging into their device, run the diagnostic, and looking at the log file. If they can’t resolve the issue, they escalate the ticket to the next tiered person. Keep in mind, that the tier 1 person typically doesn’t know who the tier 2 person is. Tier 1 will stick the alert description into the trouble ticket, and Tier 2 will likely go through the exactly same process that tier 1 went through, doubling the MTTR right off the bat.
And situations where a customer complains about an incident are no better. Let’s say I call support regarding a problem with my application. That gets passed to the application team. They might look at it and say, “that’s not us”. It goes back to service desk who says, “Maybe it’s compute”. Compute might say, “that’s not us”. This cycle typically repeats itself until the issue gets escalated and everyone drops what they’re doing to jump on a bridge call and get it resolved.
What Does This All Mean?
Enterprise IT needs a transformational change. The people, processes and tools being used today are fit for a legacy environment with a completely different set of requirements and complications.
IT organizations need to invest in new technologies that were built for our modern, virtualized world and are tolerant for the complications that come along with that.
IT organizations need to use their toolsets as a cohesive whole instead of delegating them to individual teams in separated locations. They need to leverage their monitoring toolsets to gain the most holistic perspective possible to enable operations to proactively address issues as they unfold, instead of reacting to customer complaints.
Lastly, IT organizations need to break down the communication barriers by providing the right people with the right information at the right time. All relevant people need to have a consistent view of the issues that are occurring and be enabled to communicate effectively. They need to eliminate duplication of effort and promote an entire culture change where abnormalities are addressed by cross-functional teams as soon as they occur.
About the author
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.