Just Maintaining Availability? Try Building Stability
Richard Whitehead | November 29, 2022

Today’s customers see availability as a given. What do they really want? Bigger, better technology with new features and faster platforms.

But, according to our recently released Moogsoft State of Availability Report, teams burn their time, money and energy on incident management. In fact, engineers overwhelmingly report that incident management takes up most of their time.

Team's time spent on daily responsibilities chart

With so much investment in simply keeping their systems alive, teams lack the time to proactively optimize their infrastructures. And this becomes a vicious cycle — fragile systems generate more incidents and more incidents take up more time. As a result, engineers cannot prioritize increasing the infrastructure resilience that will free time for more innovation and value creation.

How to build tech stability

  1. Take stock of your IT ecosystem’s current state.
    1. The first step in building tech stability is truly understanding your IT stack. This foundational work will set you up for the five remaining steps.
    2. Understand your organization’s business goals as they relate to availability.
    3. Determine which apps, services and infrastructure are essential to your organization.
    4. Analyze your availability targets like KPIs and service level agreements (SLAs).
    5. Review your monitoring tool stack, including each solution’s usage, maintenance requirement and licensing fee.
  2. Reevaluate your KPIs.
    1. The truth will set you free — and help you create transparency and efficiencies. But, if you’re like most teams, your managers are in the dark about your team members’ everyday activities. And, your team does not measure meaningful data like mean time to detect (MTTD) and mean time to recovery (MTTR), meaning you do not know where you are losing time. (Spoiler alert: MTTD and MTTR are a significant 90 minutes of the average incident lifecycle.)
    2. Create transparency around your work distribution (especially between managers and team members) by tagging your ticketing tools for tasks like unplanned work, platform improvements and new features and tech debt.
    3. Track MTTD and MTTR and prioritize reducing these phases of the incident lifecycle.
    4. Measure the number of times customers flag an issue in addition to your other customer sentiment KPIs.*
      *Limit those KPIs — our research shows that fewer KPIs lead to higher performance and higher levels of availability.
  3. Shrink your tool stack.
    1. With an average of 16 monitoring tools (and up to 40!), you likely have a lot of point solutions. Your disparate monitoring tools are not only expensive in licensing fees and maintenance and management time, but they also slow MTTD and MTTR by siloing information.
    2. Rank your monitoring tools by value.
    3. Get rid of less valuable tools and invest in the ones that help you meet availability goals.
    4. Reduce your time commitment to manage and maintain tools while saving money on licensing fees and decreasing noise and alert fatigue.
  4. Prioritize noise reduction
    1. If you’re stuck in unfulfilling, time-consuming monitoring cycles, try artificial intelligence for IT Operations (AIOps). An AIOps solution converges all data from across your point solutions to detect incidents sooner, reduce noise, correlate alerts and facilitate collaboration across the incident workflow.
    2. Implement an AIOps solution that connects your monitoring tools and reduces alert noise.
    3. Align leadership and teams with an AIOps platform’s single view of monitoring data and insights.
    4. Use the AIOps technology to track data on unplanned work.
  5. Pay down technical debt.
    1. As you start building system stability, you can dedicate time to further improving the IT ecosystem. Start with pre-production environments before moving on to the production environments.
    2. Use chaos engineering experiments to test the resilience of your digital apps and services.
    3. Leverage AIOps insights to determine where tech debt most affects your organization.
    4. Automate toil to release more time for your engineering team.
  6. Invest in the future.
    1. Your now forward-looking organization must relentlessly innovate the customer experience. And it must continue investing in DevOps capabilities and your system’s capacity to withstand turbulent conditions.
    2. Compare how frequently your teams and tools catch incidents versus how frequently your customers flag these issues — and report on your improvement.
    3. Push a DevOps culture and adopt DevOps capabilities.
    4. Keep your focus on the customer experience.

If teams want to move past just “keeping the lights on” to push higher organizational performance, they must reduce the time spent on monitoring and incident management. And the answer could lie in domain-agnostic AIOps. By connecting point solutions, AIOps gives teams insights from across the entire IT stack. And the technology’s informative and actionable data enables them to streamline the incident response and detect and remediate issues faster. All of this frees precious time, time that could be spent paying down technical debt, automating toil and further improving availability.

Interested in building your tech stability? Take Moogsoft’s AIOps solution for a spin.


Moogsoft is the AI-driven observability leader that provides intelligent monitoring solutions for smart DevOps. Moogsoft delivers the most advanced cloud-native, self-service platform for software engineers, developers and operators to instantly see everything, know what’s wrong and fix things faster.
See Related Posts by Topic:

About the author


Richard Whitehead

As Moogsoft's Chief Evangelist, Richard brings a keen sense of what is required to build transformational solutions. A former CTO and Technology VP, Richard brought new technologies to market, and was responsible for strategy, partnerships and product research. Richard served on Splunk’s Technology Advisory Board through their Series A, providing product and market guidance. He served on the Advisory Boards of RedSeal and Meriton Networks, was a charter member of the TMF NGOSS architecture committee, chaired a DMTF Working Group, and recently co-chaired the ONUG Monitoring & Observability Working Group. Richard holds three patents, and is considered dangerous with JavaScript.

All Posts by Richard Whitehead

Moogsoft Resources

January 4, 2023

The State of AIOps: A New Years' Message from Chief Moo Phil Tee

December 20, 2022

Why AIOps is the Connector Between Monitoring, Observability and Incident Management

November 30, 2022

How to Help Teams Create Optimal Infrastructure for Availability

November 29, 2022

Just Maintaining Availability? Try Building Stability