In the world of IT infrastructure, no matter how many precautions we take, outages will happen. Not only can they cost business dearly, but they can also be a blow to the pride of the IT professionals tasked with keeping networks, systems, and applications online.
While we cope with the reminder to stay humble, we are given a refreshing view via hindsight on what steps we could have taken to avoid the situation, and at the same time learn some lessons about how to prevent — or at least minimize — the one thing that is as sure as death and taxes: the potential for another outage…
The best way to achieve success with outages is to avoid them in the first place.
Identify Mission-Critical IT Assets
As the ultimate purpose of IT is to support business functions and be a growth-enabler, it’s essential that IT departments not only work hand in hand with business functions, but also have a clear understanding of business priorities, critical initiatives, and supporting systems.
With the help of business functions representatives, start by determining which systems and/or applications are critical for your business. In the case of a widespread outage, you have limited resources, limited time, and the critical imperative to make sure that vital functions can be restarted as soon as possible. If your core business is manufacturing, you want to have your production lines restarted first; if it’s civil aviation, you want your reservation system to be recovered with no delays, and so on.
These critical business functions and the underlying IT systems, processes, and people dependencies should be ideally documented in a Business Continuity Plan (a playbook for how the business is impacted by an outage, and how to continue doing business through a crisis), which is backed by a reasonable Disaster Recovery Plan (what emergency steps are to be taken immediately in order recover from an outage). When I say “reasonable,” I mean that the activities taking place during the first phase (disaster recovery) follow good judgement — if a factory burns to the ground, attempts at restoring manufacturing line control servers may be an act of good sportsmanship, but it’s unlikely to have much impact on the fact that production lines were destroyed.
Communication is Key
IT management and personnel across support teams should be aware of which systems are critical and should prioritize adequately. Some companies use CMDB attributes to classify which applications are critical and, based on these criteria, monitoring systems can let IT employees know that those systems require higher-priority treatment.
It helps to think on the bright side — outages do have often the virtue of exposing previously unforeseen gaps in processes and communication. While It’s reasonable to expect that, during an outage, all the involved stakeholders should show the proper sense of concern and urgency, it’s nevertheless worthwhile to ensure that emergency contacts and communication / escalation matrices are up to date and regularly reviewed.
Beyond these aspects, some companies have specific communication channels such as emergency meetings that all stakeholders (business and technical people) are invited to join in case of critical outages. Those often happen in parallel with technical calls where Subject Matter Experts work together to recover from the outage. IT Management should identify those SMEs (and eventually their deputies) and ensure they are easily reachable by other SMEs, especially for complex outages involving cross-silo collaboration.
Train Employees and Avoid Human Errors
IT personnel should be adequately trained not only to handle outages swiftly, but also to prevent them from ever happening. Many avoidable outages are caused by trivial human errors — especially during implementation of Change Requests or Work Orders.
Here are some ways to reduce the error rates:
- automate repeatable tasks
- implement four-eyes control (checker-doer) for complex tasks or one-off changes
Four-eyes control is one way to prevent things from going wrong. Each step of a change / maintenance plan / activity is performed with two engineers, one of whom reviews the technical steps, and another who does the actual implementation.
While these tips should help with human errors, IT personnel still must handle the bulk load of multiple alerts and warnings flowing from infrastructure components and monitoring systems to their mailboxes.
In a previous post (5 Tips for Modernizing Enterprise IT), we highlighted the prevalent problem of alert fatigue in large enterprise environments. The constant flow of alert emails — some of which represent only transient conditions — makes humans less and less receptive about these. Alerts land in specially created subfolders where, after a while, they start piling up by the hundreds, and are often left unattended.
Even the most seasoned IT personnel will eventually succumb to this burden unless a solution can be put in place to filter meaningful events out of the background noise of incessant alerts.
Unleash the Power of AIOps Technology
Consider implementing an algorithmic-based solution, such as Moogsoft AIOps. AIOps is an emerging class of intelligent, AI-based IT monitoring systems with deep-learning mechanisms that integrate tightly with existing monitoring solutions and ITSM systems. They use complex algorithms to eliminate noise and false positives while performing real-time monitoring across all enterprise IT systems.
Businesses that implement AIOps benefit from the following technological advances:
- Timely detection and correlation is critical in preventing outages. Moogsoft AIOps’ unique ability to spot seemingly unrelated issues, correlate them, and present them through a single interface to operators is crucial, especially in complex interdependent enterprise IT systems where an apparently innocuous issue may have cascading effects to other systems with potentially catastrophic consequences.
- When correlated issues are found, Moogsoft AIOps creates a new “Situation.” Situations are an aggregated view of multiple issues / events which have a common root cause. IT personnel can access all the relevant information (monitoring events, ITSM incident tickets, etc.) in a consolidated view in the Moogsoft Situation panel.
- Beyond communication and escalation matrices, Moogsoft AIOps knows which SMEs have worked on previously similar issues and brings them together to assist in the resolution of a given Situation. All involved parties can work together and chat directly from the Situation panel within the Moogsoft application.
- As Moogsoft AIOps handles more Situations over time, its learning engine improves detection rates & correlation between issues, smoothens the process of inviting SMEs, and suggests appropriate resolution steps to the resolving teams based on experience from similar past Situations.
While people and communication / processes should save the day, the reality of operating large IT environments is that there is just too much going on for humans to cope up with. Seasoned IT personnel are sometimes able to see correlation points between two apparently unrelated issues, but this is limited to the awareness of a given context, and their previous experiences with similar issues.
In larger environments, it gets tricky for humans to have full contextual awareness encompassing all systems, therefore one of the most important things to do before your next outage is to consider implementing an solution driven by algorithmic intelligence.
Its cutting-edge technology is disrupting and redefining the enterprise IT monitoring market, while allowing customers to gain a clear advantage over their competitors by:
- vastly reducing avoidable outages thanks to proactive monitoring and timely correlation of events
- avoiding loss of income and productivity related to outages
- increasing uptime and systems stability
- increasing IT personnel productivity and reaction times
- eliminating alert fatigue and tedious manual tasks
If you grew up in the 1980s, you may have been an avid fan of the movie WarGames. If you don’t know it, here’s the spoiler: After narrowly avoiding a full-scale nuclear war between the USA and the Soviet Union, one of the protagonists of the movie, the mainframe computer, claims that, “in this strange game, the only winning move is not to play.”
If we were to draw an analogy, we could say that “the best way to achieve success with outages is to avoid them in the first place.”
About the author Max Mortillaro