Without the right tools to properly address issues as they arise, DevOps teams are doomed to recurring incidents.
One recurring and very frustrating problem in my life is my inability to remember the passwords to my dozens of online accounts. I’m constantly locked out of services like my Chase Bank account or Amazon Prime and waste time going through the “Forgot My Password” process. The solution is simple, to say the least, but I somehow end up going through the same process over and over again – much like Bill Murray in Groundhog Day.
Not so far apart, most enterprises today deal with 10 to 20 P1 incidents on average each month.
What’s scary is that many of these are repeat occurrences. You might think that it would be easier to deal with and prevent incidents that have been seen before, but that’s not always the case. The reason for this is: instead of taking the time to manually document the root cause(s) of resolved incidents and remediation steps taken, Ops teams normally just restart servers and immediately begin to tackle the next incident. Furthermore, there’s no system in place capable of detecting the narrative of repeat incidents, thus forcing Ops teams to start from scratch each time.
One of our customers, prior to using Moogsoft, admitted that the majority of their incidents were resolved by restarting servers, which, as you know, is a temporary workaround until the exact same problem happens again. You can imagine how problematic this was. The challenge today isn’t simply the lack of knowledge or knowledge base. Many modern incident management tools now include some knowledge capture capabilities. The real challenge lies in finding, detecting and documenting the root cause(s) so that recurring incidents can be resolved in a timely fashion, and ideally, prevented altogether.
What if there was an easy, automated way for teams to document root cause(s) that later is readily available when various support teams collaborate on new incidents? What if event managers (acting as the “monitor of monitors”) could also detect and delineate incidents as repeating?
How Moogsoft Applies Situation Scoring to Find Repeating Incidents
Moogsoft analyzes and correlates all events in real-time from your production applications, infrastructure and monitoring tools to form Situations (aka incidents) when anomalies occur. These Situations are managed via a Situation Room (aka Virtual War Room) that allows Dev & Ops teams to discuss, troubleshoot, collaborate and share knowledge on how best to resolve a Situation. When a Situation occurs, one aspect of Moogsoft’s machine learning analyzes all related events and presents a list of past Situations with significant degrees of similarity.
By detecting Situations with a high probability of similarity, Moogsoft is able to recommend root causes from previous Situations via its Knowledge Base. It is also able to search external internet resources, or local knowledge bases for related information to assist in the Situation Room.
Moogsoft Knowledge Base
Chances are that your organization already shares knowledge proactively, but it’s likely spread across a range of notes, documents and emails. One of Moogsoft’s core capabilities is the capture of knowledge for reuse. As previously mentioned, Moogsoft’s Situation Room serves as a single location for Dev and Ops teams to communicate and collaborate for remediation. By automatically notification and gathering appropriate domain experts for each incident that occurs, knowledge is immediately shared and captured as the resolutive conversation begins. The entire conversation (as well as documents shared, resolution steps taken and 3rd party tool interactions executed through ChatOps functions or elsewhere) is archived within the Situation Room – available for future reference.
To further illustrate this, let’s imagine a Situation was created for an application outage that happened Monday morning at 9am. Within the Moogsoft Situation Room, a user can click on the ‘Knowledge’ tab for this Situation to see if any previous Situations had a high degree of event similarity (e.g. > 80%). By clicking on a similar Situation, the user can view a full narrative and related discussions to understand what happened, and the appropriate root cause/resolution steps that were taken, thus reducing the amount of time Dev & Ops spend investigating.
As it turns out, this Monday morning outage was a repeat Situation caused by a database batch job that overran the night before and into early business hours. Moogsoft was able to correlate high database CPU events from Oracle OEM, then correlated these with application database connectivity events from Splunk, along with failed business transaction events from AppDynamics. Fortunately a similar Situation had occurred three weeks previously when the same database batch job overran and caused an outage on the same application. The root cause/resolution was to simply to pause the batch job until midnight the next day. This is an example of how machine learning, social collaboration and a knowledge base can help reduce mean-time-to-detect (MTTD) and mean-time-to-resolution (MTTR).
The End of Recurring Incidents
Much like my inability to remember passwords, allowing recurring incidents to impact enterprise IT performance is nonsensical. Moogsoft’s centralized knowledge base, combined with intelligent Similarity Scoring capabilities, enables teams to quickly identify recurring incidents and immediately see the likely root cause(s) and impact. The result is that MTTR is slashed and recurring incidents are prevented.
With Moogsoft, everyday doesn’t have to be Groundhog Day.
About the author
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.