I started my career working L1 Helpdesk for Opodo.com back in 2001, when we launched the site in Germany. Looking back, I had it easy. Our application environment was simple, it was basically a monolithic Broadvision app connected to a bunch of booking API’s and a backend Oracle database. When the app failed (and it did fail) I had about 5 phone numbers for the L2 application support, network, hosting provider, booking system and the ticketing system. Essentially, my job was to call everyone up, ask them if everything was okay, and then chase them until I got an answer. That approach generally worked, albeit it p*ssed off everyone involved. 😉
Now fast forward the clock to today. I generally feel enormous sympathy for people who run L1 Helpdesk now. Working Helpdesk is like being in the trenches; you’re constantly being shot at, and asked to run in different directions until you find the people responsible for all the carnage and chaos. Today, that same L1 Helpdesk I ran probably has hundreds of incidents, with hundreds of people involved, all triaging the components and dependencies of a modern application architecture and IT environment.
The Software-Defined Business
This got me thinking: Every business wants to become a “software-defined Business” and is bank rolling all these initiatives like Agile, Cloud, Mobile, Big Data and Micro-services to compete. All of this sounds good for the people developing these modern apps (lots of cool technology…), but what about the poor people who have to support it all? I decided to summarize my rational thinking into two visualizations – the below table and illustration.
The table below highlights the complexity of being a Software-Defined Business. The cost of being agile and competitive in the marketplace can have a significant impact on the amount of Scale and Change that IT support teams needs to manage. The answer to this problem can’t always be “hire more people”.
The illustration below visualizes what enterprise IT support teams now need to manage on a daily basis as a result of the business moving faster. At Moogsoft, this is what we hear nearly every single day from customers trying to tame this complexity.
Can IT Support Today Really Cope?
I mean really, can they? Forrester claims that 74% of end user problems are not detected by IT, that’s a scary number. Having worked Helpdesk in the past, and actually spent many years building web applications, I have no doubt this is true. Look at the above complexity, change and scale of what IT support teams now have to deal with, and the bad news? It’s getting worse every single day. Even with the best application monitoring tools in the world, IT support teams still need to acknowledge alerts, events and spend time analyzing the billions of metrics being collected every minute, and hereby lies the problem and bottleneck – the human.
Rise of the Machines
There is only so much information we as humans can observe, digest and interpret at any one time. I’ve noticed a lot of the new Application Performance Monitoring (APM) startups believe that real-time dashboards are the future, where operators manually pick from thousands of different metrics and overlay them over time-series charts hoping to spot anomalies. This all sounds good in theory, except it’s completely unmanageable in reality when your applications have thousands of components, billions of metrics and hundreds of changes every day. Humans can no longer cope with this complexity, they now need machines to do the leg work, process this Big Data and provide operational insight into what the f**k is going on.
Will We Trust the Machines?
DevOps is all about automation and being agile so that the business benefits. As humans we like to think we have all the answers, or at least, be capable of finding those answers from other humans. I spent the best part of ten years monitoring, troubleshooting and optimizing applications. I created more custom dashboards than Batman and spent more time looking at charts than a Wall Street trader. I can tell you, its bloody hard work, and there are so many pitfalls it can be overwhelming at times. Does all this manual work make sense in today’s automated world of Agile, Cloud and DevOps? Nope. We’ve spent the past decade automating server provisioning, release management and the configuration of monitoring tools, why can’t we now automate the troubleshooting of applications using machine-learning and analytics? Do we humans now have a choice? Can we really keep up with the complexity, scale and change of the applications we now develop and support?
At Moogsoft, we’re trying to help DevOps teams do this, by applying machine learning to the way they detect, troubleshoot and manage applications and infrastructure. It’s going to be a tough journey to convince humans to embrace machine insights, but hey, if it were easy, everyone would be doing it!