Sysadmins don’t carry pagers anymore, so why do incident response processes still assume dumb display-only devices? Go mobile.
The idea of IT that many people carry around in their minds is distorted in many ways, but one aspect is particularly interesting: Whenever anything especially technical is happening, it is probably happening in a text-only command line interface. This is true even in a world of improbable computer interfaces as envisioned by Hollywood.
There are are all sorts of interesting reflections on the relationship of the command line (CLI) and its better-known descendant, the Graphical User Interface (GUI). Neal Stephenson literally wrote the book on this topic, with In the Beginning… Was the Command Line.
Meanwhile, in Enterprise IT…
However, this is a blog about IT Operations in particular, and what interests us is what happens in the real world. At Moogsoft, one of our central assumptions is that IT infrastructure growth has long since sailed past the point where artisanal administration of individual devices over an SSH connection was feasible or even desirable. So while CLIs may come into play, they are no longer the main mechanism for managing IT. Instead, system, network, and application administrators use a variety of specialised tools that present information that is useful for a particular task. With the cultural shift to DevOps bringing increased participation in IT Operations from roles that might formerly have been exclusively focused on engineering, even more tools and perspectives get added to the mix.
It’s all well and good to say “developers should carry pagers,” but if they’re getting paged every night, they’re not going to be doing their best work the next day.
This is all fine and good, as long as there is sufficient time to reconcile what might be very different views of the state of the environment. Problems begin to appear as the rate of change begins to rise, moving asymptotically towards the continuous-change model of CI/CD. Understanding potential impacts and diagnosing problems becomes harder and harder as the environment continues to change around the Ops team, including through automated processes — and of course each change may itself cause an outage.
The consequences can include people getting burned out from the cognitive load of maintaining a coherent picture of the state of business-critical IT systems, as well as slowdowns or even failures of support processes, leading to incidents and even outages. It’s all well and good to say “developers should carry pagers,” but if they’re getting paged every night, they’re not going to be doing their best work the next day.
How to Avoid Notification Storms and Operator Burn-Out?
The answer that we came up with is a rich GUI that can show people only information that they need to have right away, intelligent AIOps software that can involve the right people at the right time, and now also includes a powerful mobile user interface, so that specialists can easily diagnose and even resolve incidents in their areas of responsibility, right from the screen of their smartphone.
The biggest contributing factor to incident duration is wait times: waiting for a human to notice an alert and react to it, waiting while that human locates and communicates with colleagues in other teams, waiting for an escalation to be picked up. There are also many shorter wait times dotted throughout the lifecycle an incident. Nobody carries actual pagers anymore, but if your notification process involves sending an SMS to someone, who then needs to find a “real” computer, log in, and navigate to whatever incident it was that triggered the notification, there is a lot of time being wasted there.
It’s also true that most “real” computers still don’t come with cellular connectivity, so getting online from one of those may take even more time — all the while, the pocket supercomputer that we all carry around with us is right there, vibrating itself off the table with incoming alerts and queries from users. (Seriously: an iPhone X can apparently beat a 13” MacBook Pro in single-core performance).
So why not just use that pocket computer, and take a bunch of unnecessary friction out of the process?
Fluid, Mobile-Enabled Incident Response
With the general availability of Moogsoft AIOps’ mobile interface, we enabled people to diagnose and resolve incidents directly on their smartphones. Now specialists can respond to a notification on the same device where they received it, jump straight into a Situation Room to collaborate with their colleagues from other disciplines, review the alerts that triggered the incident in the first place, and even use ChatOps features to diagnose and resolve the issue — all from their smartphone, perhaps without even having to get out of bed.
That last point is critical: People need to be able to take action, too, not just chat and look at reports. While SSH apps for smartphones exist, and can even be useful in an emergency, they are hardly a pleasant experience. If, on the other hand, you need to go to your laptop to actually solve the problem, you’ve just put all that friction right back into the process. Instead, with Moogsoft AIOps, you can solve the problem from your phone, put it back on its wireless charging pad, turn over, and go back to sleep.
If that scenario sounds too futuristic, you may need to revisit your evaluation criteria for enterprise IT Operations software. It’s 2018, and anything that forces operators to log in to a desktop-only interface (let alone one that requires Flex, Java, or *shudder* a fat client) is not just behind the times, but annoying both operators and users.
Accelerate response times, make sysadmins happier, and solve problems before users are even aware of them — now that sounds like the sort of future we should be working towards.
About the author
Dominic Wellington is the Director of Strategic Architecture at Moogsoft. He has been involved in IT operations for a number of years, working in fields as diverse as SecOps, cloud computing, and data center automation.