Nearly 30 years ago, the evolution from monolithic to modular compute required innovation to scale operations. Today's monolithic to modular application evolution is no different.
When we published the blog entitled “Netcool…NotCool” a few years ago, readers thought we were being critical of our old product. But that was far from the goal of the discussion.
Our point was simply to help people recognize that Netcool’s time was (and is) over. The world had changed. The subsequent Netcool owners at Micromuse, and then at IBM, failed to innovate. While all around them, Rome was burning.
This conflagration created the opportunity for us to create Moogsoft.
A small team of us (shout-outs to Phil Tee, Richard Whitehead, Fred Mutavdzic, Chris and Angela Dawes, Ken Barth, Adam Kerrison, Mark Peach, Colin Pittham, and DJ Walker-Morgan) set up the Netcool business on the back of Phil’s original idea and inventions. Back then, we were an innovation powerhouse that trounced all over convention. Our basic premise was that Service Assurance products needed years and years and years of consulting to enable them to deliver value, and that the rate of change of the IT infrastructure was too fast for the existing offerings to cope.
We burst the bubble on the leading IT event management solutions at the time. We exposed them as a consulting tax on operations that added inertia to business change.
These included OSI NetExpert (remember them?), MAXM Systems (remember them?), Boole and Babbage (ahem, now BMC Truesight), HP Operations Center (ahem, now Micro Focus), and more.
Netcool’s Legacy, or (unabbreviated) Netcool Is Legacy
To us, Netcool is incredibly cool simply because, although the code line is more than 27 years old, it still contains zeros and ones input by Phil Tee himself. Even more incredible, customers are still using it.
Pretty cool legacy huh? But that’s exactly what it is… legacy software.
As anyone who has tried to deploy Netcool (…or BMC Truesight, …or anything from CA Technologies for that matter) in a modern hybrid cloud infrastructure knows, it just doesn’t work. It’s not that you can’t get it installed or processing events. It’s simply that with high frequency change and a significant proportion of modern telemetry (e.g. application logs) having no designated severity, administering Netcool to actually produce actionable information is unsustainable.
Actually, it’s impossible.
Don’t believe me? Just ask a Nordic Telecoms Managed Service Provider about their lack of realized value when starting from scratch with Netcool in a modern infrastructure!
When you factor in application software modularization and DevOps practices, Netcool is not in the vernacular.
Unfortunately for the customers of Netcool, the people we left behind at Micromuse (and subsequently IBM) didn’t grasp that there is a need for constant innovation. Instead they turned the company into a consulting business, with Netcool as their base toolkit.
The Elephant in the Room: No More Heroes Anymore
Well not enough of them, anyway.
Ironically, the problem that we solved with Netcool back in the early 1990s is the same problem that DevOps and DevSecOps faces today. Namely, buying time. How to enable the few “Heroes” to cope with the explosion of workload while also reducing the business impact of incidents.
In the early days of Netcool, we were in the midst of the transition from Monolithic Compute to Modular Compute (aka Client/Server). There were not enough skilled IT operators, neither in the resource pool nor the budget, to sustain one skilled operator per console. Netcool transformed the economics of IT Operations by enabling a single skilled IT Operations Hero to assure multiple consoles, reducing resource and business impact costs.
Funny enough, so successful was this transition that an entire standard evolved to popularize it. It’s called ITIL.
Today we are in the midst of another transition. This is the shift from Monolithic Software to Modular Software (aka DevOps and DevSecOps). As was the case 27 years ago, there are not enough DevOps Heroes to assure each and every microservice. The sheer quantity of microservices and their frequency of innovation changes constantly.
The Costs of Lost Time Affecting Resources & MTTR
Application and IT infrastructure complexity have transformed end users into the default incident detection system. By the time Operations & Support are aware of any issue, they are already outside of their service level agreements and the customer is impacted by the issue.
The end-user “ticket” is invariably directed to application support who, after spending time attempting to diagnose the issue, in >65% of cases transfer the ticket to infrastructure support. From here, depending upon the diagnosis complexity, >25% of cases escalate to an all-hands war room.
Time lost can be categorized along two dimensions: Business Impact time (measured as mean-time-to-resolution, or MTTR); and Operations Resource time (measured as total hours burned by full-time employees, or FTE stakeholders).
In the case of infrastructure assurance, there will be an existing workflow in place. Most likely ITIL derived, that will need to be seamlessly integrated in order to ensure adoption and maximize efficiencies.
Today the typical incident workflow that Operations & Support employs, called “best practice ITOM and ITSM tooling and process”, actually has a net negative impact on both MTTR and resource time. It has led to the customer and end user bearing the burden of incident detection.
This has in turn led to an increase in “ticket tennis” — where an incident ticket is passed between tiers of expertise in one operations silo, and then across and between operations silos, and then, to an all-out war room of highly skilled IT Heroes assembled across the company, batting the blame between them.
The result: both full-time Operations & Support FTE resource time, and end user / business impact time, increase uncontrollably.
When the stuff really hits the fan, an incident war room is a crowded place. Those invited include infrastructure administrators, software engineers, security engineers, site reliability engineers, first level responders, and architects. When it’s really serious, no one is surprised when IT management, lines of business stakeholders, and executive management stops by for a less-than-friendly visit.
In an attempt to reduce MTTR, modern best practice IT Operations introduces “Command Center” workflow. Once it is determined that a designated critical application is impacted, the Command Center assembles all hands from across applications and infrastructure support into a war room. The war room attempts to diagnose causality and subsequent remediation actions more quickly. Although this does enable a reduction in MTTR when compared to traditional ITIL methodologies, the Command Center has the inadvertent effect of significantly increasing the FTE resource time, with more resources engaged for a longer period.
The “Ops” in DevOps Is Efficient, Right?
Well… If we look at a typical factory-scale adoption of DevOps / DevSecOps best practice and microservices, the insular nature of the operations best practice leads to a lack of situation awareness across interdependent teams.
Software engineers spend an increasingly frustrating amount of their time, often overnight, interrupted from Continuous Integration & Delivery (CI/CD) in order to waste time on operations. All this effort is expended only to work out, after much diagnostics time, that their respective microservice is not the cause of the problem. Often again, all this occurs after incident detection has been made by the end user.
Apart from the costs we’ve discussed in relation to sunk operations and business impact, there is another issue that’s rarely mentioned: Churn.
In case you haven’t noticed, the job of Operations & Support has gotten harder and become less fun. The compute stack has increased in complexity from the bottom to the top. Typical L1 and L2 IT operator chairs have a very rapid churn. Industrial DevOps means software developers work in a shark tank, with Ops problems increasingly pulling them away from the coding that they love and were employed to perform.
AIOps to the Rescue? Perhaps…
Is solving the economic problems of operations resource time, business impact, and staff churn possible with AIOps technologies?
Yes, however only if you implement the right kind of AIOps. AIOps can help reduce your MTTR, the resources required to assure your services, and the employee churn degrading your resources.
The right kind of AIOps can also enable you to sustain a higher frequency CI/CD cadence, regardless of the increase in software modularity.
For AIOps to be effective, we must be cognizant of two target use cases:
- Application stack & hybrid cloud infrastructure assurance
- DevOps & CI/CD
The reality of event data streams in both use cases is that:
- Single faults rarely cause significant impact.
- The behavior that causes adverse application problems is constantly changing.
- The rate of behavior changes is aligns with the frequency of application and infrastructure change.
- Application and infrastructure telemetry messages (i.e. the events that can indicate incident causality and impact) are constantly changing.
- Application and infrastructure telemetry messages often also lack any kind of categorization of problem severity.
- The majority of the telemetry messages (in excess of 90%) have no value to Operations & Support.
- This results in a Big Data landfill store of alert information.
- The low quality of this data makes it difficult to gain enough insight to resolve problems quickly.
Moogsoft AIOps to the Rescue? Actually, Yes.
Here at Moogsoft our nearly 250-strong Herd has innovated a modular, patented AIOps platform to enable DevSecOps Heroes to assure all their microservices in the face of relentless CI/CD pressure. We now have 50+ AI patents… and counting. We continue to innovate every hour of every work day, with software engineers on three continents.
This is all well and good, but frankly we didn’t start Moogsoft to create intellectual property patents.
We founded Moogsoft to answer an economic problem that faces every IT Ops team and DevOps focused application owner, in every enterprise, worldwide. Namely, how to master the most elusive of variables impeding rapid MTTR:
The patented, modular Moogsoft AIOps platform reduces sunk time along both dimensions: Operations and Support resource time; and MTTR–the best measure of business impact on users and customers.
Moogsoft AIOps performs the following actions:
- Surfaces the signal from any noise polluted event telemetry data stream
- Detects critical incidents earlier
- As the incident is evolving, creates a narrative cluster of alerts by analyzing timeline, topology and semantic alert data
- Notifies appropriate incident stakeholders–the causal diagnosis team as well as all impacted parties
- Enables Operations & Support resources to act earlier
- Enables Operations & Support resources to take appropriate action–such as diagnostics, remediation, or business continuity management
- Captures behavior and knowledge with every incident resolution process using machine learning
- Recycles this analysis as Predictive Insights that improve the orchestration of human and AI automation resources
When compared to traditional ITIL incident management, Moogsoft AIOps drives situation aware workflows. On average, these can cut resource costs by >80% and business impact costs by >60%.
When compared to ITIL Command Center incident management, Moogsoft AIOps can cut resource costs by >80% and business impact costs by >40%.
When Moogsoft AIOps is applied to a DevOps workflow, average software developer productivity increases by >25% while simultaneously speeding CI/CD frequency.
But that’s not all… in classic Steve Jobs fashion we’ve got “one more thing” to share with you.
In our quest to solve the IT Operations economics problem of digital transformation, we also set out to bring joy to operators by making their job more fun. The Moogsoft AIOps platform offers earlier warning, situation awareness, and collaborative workflow between teams. We’ve enabled Operations & Support to be aware of an issue before it becomes business impacting, diagnose the causality, and remediate more quickly–often before customers and end users are impacted.
Moogsoft AIOps is the only AIOps solution to bring economic value and joy to the world!
About the author
An expert in IT operational management and technology commercialization, Mike launched SunNet Manager in the UK for Sun Microsystems before founding an open systems service management business at Micromuse where he brought several innovative service management tools into the European market (such as Remedy) and established key OEM relationships (Cisco, HP, Intel) that led to successful IPOs for both Micromuse and RiverSoft. Today, Mike is focused on and scaling Moogsoft by overseeing strategic business relationships with key partners around the globe.