ITOps teams that have invested a lot of time and money in monitoring tools can be reluctant to let them go. But what if they could make those tools better?
Having good monitoring tools is incredibly important in IT operations. Whether you work in a small mom-and-pop shop, or in the largest enterprise in the world, flying blind is never a good idea. Typically, each group within an IT organization will have their own tool or set of tools, with very little collaboration across groups and toolsets.
It’s as if each group within IT is in a swim race, swimming in their own lane, unaware of what the others are doing. The “winner” of this race is the one that is able to claim “incident innocence.” It’s like that unofficial metric you may have heard of, MTTI — Mean Time To Innocence, or how long it takes to rule out your own area of responsibility as a potential root cause of an incident under analysis. Anything that can shorten MTTI helps overworked ITOps teams zoom in on the actual root cause, instead of wasting time on symptoms.
Modernizing IT Operations
I grew up in IT working at a large services and software company where my division was focused on building and hosting property and casualty insurance systems for midrange and large insurance companies. We managed about 1,200 virtual and physical servers for about 75 customers. I spent almost 16 years as a developer and systems administrator in the infrastructure and operations group between 1998 and 2014. In the early days, we had no tools and a highly manual workflow for managing incidents. About 80% of our issues were customer-reported. As time went on we got more and more monitoring tools. I built our ticketing system and the system that automated and tracked all our code promotions way back in 2002, which is still in use today.
In the early days we had no tools and a highly manual workflow for managing incidents. About 80% of our issues were customer reported.
Back then it was not uncommon for the Ops team to be on call for up to 36 hours, working around the clock to resolve outages and fix system problems. We were highly reactive, which put a heavy burden on support staff. In 2012 my boss approached me and asked me to modernize our operations. He asked me to focus on two things: monitoring and automation. I then led the technical evaluation of various products, including an APM solution. We ended up purchasing what we determined to be the APM product which best fit our needs: Dynatrace App Mon (container based agent monitoring) and Dynatrace DC Rum (passive network device decoding all network traffic).
Dynatrace was a game changer for us, as our existing monitoring was fairly basic, and the products we used were based on remote metric collection. Adding more advanced APM from Dynatrace changed the way we built and supported our software in an extremely positive manner. We ended up with a tool that allowed us to see 100% of our transactions, from browser click to code execution. We could see each and every SQL statement, and had visibility into browser and code errors. It was a level of visibility we never had — when users called into our help desk, we could go back historically and see exactly what the user clicked on, and the associated code execution on the back end.
However, there was still something missing. As with any powerful piece of software, there was a bit of a learning curve to using Dynatrace. As events started to fire and management was alerted, users would often come directly to me for answers, instead of figuring out how to use the product. So I embarked on a series of training sessions. I probably ran around 50 training sessions over two years, trying to help users get the most out of Dynatrace.
Even with my help, though, using Dynatrace properly was still too difficult for many. Users were getting spammed by alerts and didn’t know what to do. The data was great, but there was so much noise it was tough to weed through it all. Additionally, the disconnected IT silos were still there, so collaboration between teams was still a problem. We would spend countless hours on bridge calls at all hours of the day and night, which transitioned into war rooms where our teams would spend days or even weeks trying to fix an outage. If you have ever been in this position, you know that it is extremely frustrating — not to mention that the war rooms started to get fairly rank, fairly quickly.
A New Hope: How To Take Full Advantage Of Dynatrace
I left that organization in 2014 to work for Dynatrace as a Sales Engineer, and then left Dynatrace last year to come work for Moogsoft. Had I known about Moogsoft while at that large services and software company, it could have completely changed our incident management process, and as a result reduced our MTTR times.
Moogsoft uses AI to reduce the operational noise and cluster related alerts into Situations. A Situation is an actionable item that is worked in a fully collaborative manner, breaking down the IT silos. All monitoring data feeds into Moogsoft, and actionable Situations are intelligently assigned to groups of individuals who are responsible for the services impacted by the Situation. In addition to the virtual war room, the platform also allows users to introduce automation. If we had had Moogsoft, I could have integrated with our automation platform to handle repetitive automation tasks, instead of users having to recreate the wheel every time a repeat outage occurred.
I still keep in touch with my colleagues at the large services and software company, and they still suffer from Dynatrace alert fatigue. It is still painful to hear about, and as much as I wanted to believe that we were unique, I encounter the same issues almost every day when talking to Moogsoft prospects. Having great monitoring tools like Dynatrace is critical, but they can only take you so far. Having a single view across all of your monitoring, and using AI to reduce noise and cluster related alerts, is no longer something that is nice to have — it is essential.
In 2017 Gartner said that AIOps market penetration is about 10%. Within four years that number will grow to over 50%. You owe it to yourself and your organization to learn about what an AIOps solution can do for your bottom line. I encourage you to compare Moogsoft with any other vendor so you can see for yourself why we are the clear leaders in the space.
About the author
Brien Lay is a Strategic Architect at Moogsoft. Prior to working at Moogsoft, Brien was a Sales Engineer at Dynatrace, and prior to that he was a developer and systems administrator at CSC. Brien is also a former Army officer, and is a graduate of The Citadel.