I recently watched a great documentary on PBS about NASA’s moon landing missions. I then started to think about IT incident management (funny how my mind works). You might be wondering, “How does managing IT incidents have anything to do with an astronaut trying to safely land a lunar capsule on the moon?” Well, both situations involve quickly reducing data to see what’s most important.
For instance, operators in both situations look into a sea of data. Flooded with hundreds of events and alerts per second, IT command center operators often lose sight of situational awareness and struggle with managing the IT environment plane. Astronauts have a similar challenge. Barraged with a dizzying array of instrumentation panels, astronauts must quickly find the signal amongst all the noise to land the spaceship safely.
There is a great lesson learned here from the courageous and brilliant fighter pilot, Mary “Missy” Cummings: Ms. May was one of the Navy’s first female fighter pilots, and later became a professor at MIT researching human interactions with complex and autonomous systems. May designed a stunningly simple instrumentation concept, as shown in the picture above: A “waving arm” instantly notifies the astronauts that the landing will be safe vs. unsafe, along with the most relevant and accurate readings of altitude and speed as a background reference.
The result? A simple, reduced display of the most important information leading to a real-time decision process for astronauts to land safely, every time.
The lesson learned? Deciding “what NOT to show” is essential to achieving real-time situation awareness and making the correct assessment fast.
How This Relates to IT Operational Teams
This lesson is particularly relevant for IT operational teams, who need to deploy tools that can automate event data processing to keep up with the sheer volume, velocity, and variety of events emitted across the entire IT environment, in order to get early warning of anomalies as they unfold, amongst a sea of noise. When IT operational teams try to navigate safely to find and resolve incidents before they become major, knowing where to land in real-time is critical. Given all the domains affected by this unfolding incident, it’s crucial to know which are causal and which are collateral. The application, the database cluster, the hypervisor, and the storage cluster are all alarming, yet which should I be drilling down first?
Following Ms. May’s logic here – if you start with a clean sheet of paper and write down your ideal criteria for an automated incident management tool, it starts to look something like this:
- It scales across millions of events in real-time.
- It ingests event feeds across the entire IT stack – from the application to the physical infrastructure, from each domain-specific monitoring tool, and across the hybrid of public and private cloud.
- It has no dependence on static rules or models – landing situations are always changing and agility is essential.
- It assumes no “single root cause”. Like Mary’s brilliant landing display, it automatically correlates events and alarms using algorithms, from application and middleware to compute, network, storage, and databases – synthesizes it all immediate into “safe” or “unsafe” feedback. How would you like 100x higher “signal to noise” ratio?
- It notifies the appropriate experts to collaborate in a virtual “situation room”, giving them all the relevant sets of readings to share and comment upon.
- It integrates tightly with major IT Service Management platforms for problem remediation and ticket coordination.
Brilliant minds think alike. Phil Tee and the founding Moogsoft team invented and commercialized Netcool (now part of IBM Tivoli), and the whole Manager of Managers (MoM) concept. Twenty years later, Phil and the crew have influenced the IT industry yet again with Incident.MOOG, the first ever next-generation MoM solution. Now, you don’t need to be a rocket scientist to retain situational awareness in your IT environment – now there’s Moogsoft.
Want to learn more about Moogsoft’s products? Contact us at firstname.lastname@example.org for more information, and be sure to connect with us on Linkedln, Twitter, Facebook, and Instagram to stay up-to-date with Moogsoft. You can also sign up for our monthly newsletter, Across Silos.