There are 4 critical functions to consider when running war rooms effectively.
As the digital transformation of companies gathers pace, and as more services are underpinned by complex and ever-changing IT solutions, the need to quickly react to any service interruption has never been more critical. In today’s modern IT environment, the war room is alive and well – by “well” I mean it exists as it did decades ago – though it runs the gamut from well-run collaboration environments to contentious meetings full of finger-pointing.
When faced with an incident of significant impact, the war room provides a space where all people needed to analyze and fix an issue can collaborate quickly to restore services. There can be different approaches to how this is done. Large organizations may have a dedicated team of incident managers, while others may have teams made up of the same team whose services impacted.
There can be physical war rooms as well as virtual war rooms or a hybrid of the two. Many teams may be required from different areas of an organization, or it may be a handful of individuals who can run an efficient war room.
The stakes are high. Today’s consumers have a choice in the services they use, and with customer interactions increasingly moving towards digital offerings, the human face of a company becomes less and less important. What matters now is uptime and functionality. If you want to check your bank balance you expect to be able to do this at any time, using any device, from anywhere. Booking a flight or a hotel now takes minutes. Ordering food delivery (or almost anything) can also be done at the click of a button. Soon drones may drop off your purchases at the front door. As a result, IT Operations teams need to ensure they are keeping services online without interruption.
Fortunately, automation is helping IT Ops teams restore services faster than ever before by identifying the probable root cause of an outage and correlating related events into a single situation. With or without automation, war rooms need structure, organization, and rules of engagement. This helps all participants to remain objective and make the right decisions while under pressure.
Allow me to share some best practices from my experience of running and participating in hundreds of war rooms.
Automation is helping IT Ops teams restore services faster than ever before by identifying the probable root cause of an outage and correlating related events into a single situation.
There are four critical functions that must be considered to run war rooms effectively.
This is the most important step: alert both internal and external stakeholders that there is an IT issue. Once the war room leader(s) has been decided, information should be gathered including the context of the IT issue, an initial overview with impacts, any workarounds, the ETA for a fix, and the timeline for upcoming communications. Internal communication should be focused on alerting other IT teams, with management and business users providing information. External communication should be directed to the impacted users and customers. There may be dedicated teams who deal with this, and specific channels may be used for distributing clear and detailed messages.
Once you have started a war room and the initial communications have been sent, the next step is forming the response team. The best practices here are to avoid including everyone and to be ready to dismiss people who aren’t needed to resolve the issue. Rather, take a “lean team” approach by looking at the errors and symptoms as a way to help guide your selection of team members. When new team members join the war room, be ready to bring them up to speed operationally with the communications summary, so they can quickly become an active contributor.
Now it’s time to investigate the root cause or provide a work-around. It’s a good practice at this stage to also look at any changes that have recently gone live which could have contributed to the issue. Assigning tasks and agreeing on check-back points is a good way to allow specialists to focus, rather than asking them to juggle constant feedback while taking on technical analysis and solution proposals. Keep notes of all activities to inform the post-incident debriefing process.
If there is little progress, or if the impact on users is increasing beyond what is considered acceptable, it’s time to alert executive management. Use the notes and communications to help the management team access the issues and decide on a course of action.
These four steps can be repeated until service is restored. The main goals for running a war room are to remain cool, be objective and facilitate the speedy recovery of an IT issue. Using a framework and having a process in place will help ensure that all members of an IT organization know how to handle a major incident and contribute to better recovery times.
To learn more about modern IT war rooms:
About the author
Adam Frank is a product and technology leader with more than 15 years of AI and IT Operations experience. His imagination and passion for creating AIOps solutions are helping DevOps and SREs around the world. As Moogsoft’s VP of Product & Design, he's focused on delivering products and strategies that help businesses to digitally transform, carry out organizational change, and attain continuous service assurance.