In this blog’s first part, we discussed how competitive pressure on businesses has forced application developers and infrastructure providers to become agile and adopt modern technologies and processes.
This has massively complicated the support of IT environments, creating a need for adopting AIOps to:
- Identify incidents earlier
- Make relevant stakeholders situation aware
- Pinpoint causality and resolution actions
Here in part 2 we focus on the need for AIOps to increase the productivity of DevOps software engineers and the frequency of the modular software development lifecycle.
No Need for AIOps with Agile, Right?
Urban mythology has it that our migration to DevOps is the panacea that underpins IT’s ability to sustain an agile business.
And yes, in a perfect world this could be true. But the reality does not deliver upon the myth.
Although modularization undoubtedly increases the rate at which new features are added to a given service, there is an elephant in the room: The more we modularize our software and increase the number of APIs / DevOps teams, the more our software developers’ productivity decreases.
It is not difficult to understand why. We already know that the end-users have become the incident detection system. This remains the case for modularized software applications.
But now we’ve multiplied the number of silos of operations — not at the infrastructure level, but above the infrastructure line where each DevOps team is a silo unto itself. By their nature, DevOps / modularized software teams do not need awareness of other teams. They just need awareness of what APIs are available to them to be utilized by their own (micro)service — or to provide to other services.
There are two problems with the “Ops” in DevOps: Excessive application-performance alerts and tickets raised by end users.
The problem for a DevOps team is: what metrics to assign an alert threshold to, and what threshold should trigger an alert. The decision comes down to experience and trial and error. When the trial-and-error wheel spins quickly, the result are too many alerts.
We know that when the end-user raises the alarm to report an application-impacting incident, the ticket (or notification) is assigned to the application support team. In the case of DevOps teams, when the alarm is raised by the end users, all the DevOps teams are woken up.
DevOps teams can comprise 4 or more people. It’s important to look at how a DevOps team works when an alert page or Slack notification is received: The whole team jumps to act. If >80% of the time their application or service is not the cause of an issue, every triggered alert (or ticket) pauses software development activity (or sleep!) for 10 to 15 minutes. If the team is paged 20 times per day, that is 20 minutes multiplied by 10 minutes multiplied by 4 resources: >13 hours of lost software development productivity per day, per DevOps team.
Now, each DevOps team is an island unto itself, unaware of their peers’ activities and only communicating with them through a series of APIs. If one team deploys a new code line which introduces errors, this will have a knock-on impact not just to the nearest neighboring microservices that are consuming those APIs, but will in turn impact the neighboring APIs and therefore impact those downstream API-consuming microservices.
That means that one buggy ‘continuous deployment’ can impact a whole set of peer and downstream DevOps teams. They will all receive alerts and they will all ‘wake up’ and attempt to diagnose their issue (once they are aware of that issue via the call from the end users mostly).
The result is illustrated in the graph below.
The first team to become aware of an application performance incident will begin to investigate, consequently working out that the causality lays not with their microservice. As time goes on, either other teams will become aware of their issue organically or via ‘word of mouth’, and further introspective diagnostics will be carried out by the whole team.
Ultimately the causal stakeholder team will become aware of the incident and then diagnose that they are the causal party. Hopefully within a reasonable timeframe they’ll either roll back to an earlier, known quality release or remediate the issue in their codeline.
Either way, a significant amount of time and software development productivity can be wasted in the diagnostics of causality across DevOps silos. As illustrated in the graph, the time to detect the incident begins ticking after end users have become impacted.
Moogsoft and AIOps to the rescue? Of course!
As discussed in part 1, Moogsoft came about to solve the two primary problems of:
(a) detecting incidents earlier, as they are evolving and;
(b) bringing situation awareness to the appropriate stakeholders to a given incident to enable them to take the appropriate action.
With Moogsoft, causal parties can diagnose and remediate faster, and collateral parties can become aware more quickly and notify their consumers and end users that there is an incident. At this point, end users are aware of the incident and/or enact their business continuity processes.
This is as true and consumable for DevOps practitioners as it is for IT Operations, IoT and security practitioners.
In the case of a DevOps team, Moogsoft enables each DevOps team to become aware of an issue relating to their microservice very quickly. However, instead of the whole team being ‘woken’ to act, one ‘on-call’ team member can be notified and, using Moogsoft’s contextual alert cluster, can quickly work out “am I the cause or is the cause external to me?”
If the team member works out that they are the causal party, wake the team.
If the cause is external, then trigger the business continuity plan but don’t wake the team.
In this way, Moogsoft can increase software developer productivity by >25% and at the same time: reduce business impact, increase the quality of the customer experience, and increase the frequency of the CI/CD software development lifecycle.
About the author Mike Silvey
An expert in IT operational management and technology commercialization, Mike launched SunNet Manager in the UK for Sun Microsystems before founding an open systems service management business at Micromuse where he brought several innovative service management tools into the European market (such as Remedy) and established key OEM relationships (Cisco, HP, Intel) that led to successful IPOs for both Micromuse and RiverSoft. Today, Mike is focused on and scaling Moogsoft by overseeing strategic business relationships with key partners around the globe.