This is the eighth in a series of blog posts exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this chapter, our IT Operations lead, James, implements a new service desk and relies on Moogsoft integrations to supercharge incident management.
It’s been a month since Dinesh and I humbly high-fived leaving the meeting with Charlie and Lucia and they gave us the green light to roll Moogsoft out across the whole of C&Js and I’m feeling a little weary. Change is hard. I’ve also made it harder on myself by persuading Charlie we should also migrate our service desk solution.
We’ve been using our existing one for as long as I can remember and some of the problems we have with it aren’t really the technology itself, more our specific implementation and our rapid consumption of heavyweight IT service management processes in the past. Charlie justified this major change on the basis that he’s sick to the back teeth of hearing people complain about it and it’s a hefty lump out of his annual IT budget. He keeps calling it technical debt, which I’m not sure is quite right, but I’ll go along with it as a justification. My motivation is more that I’ve seen how Animapanions are using a solution that ties right into their backlog so the line between a new requirement and a support ticket becomes completely blurred. As Sarah explained to me, it’s a cycle. Problems so often need to become backlog items and included in the sprint cycles, it makes sense to have them tightly, almost invisibly integrated. And we get full traceability through the value stream so we can measure cycle time and other flow metrics.
Also, they showed me their integration into Moogsoft and how she was now creating and updating service desk issues from an open Moogsoft situation and a very cool auto-assign feature. We’ve found it hard to set up workflows and integrations with our own solution, but with this one, it’s pretty much just a webhook.
So it all made a lot of sense until I had to get other people involved. I am officially now a herder of humans. I cannot tell you how many times I have had the same conversation in the past two weeks. And how often I’ve heard the same statements from people who almost in the previous breath said they would rather die than continue working with the existing system. Take this one with my counterpart supporting the finance system for example. Jeff has been with C&Js nearly as long as I have and has some serious ITSM knowledge and skills but, well, let’s just say we haven’t always seen eye-to-eye. And I was having almost the same conversation in nearly every department.
“But we’ve always done it this way,” he said. “If it ain’t broke, why fix it?”
I took a deep breath, trying not to sound rehearsed even though I’d already said the same thing to umpteen people already that week.
“Well, what if there’s a better way. What if we could identify a system so fast that we could fix it before it even became an incident?”
“Ha, that’s crazy talk. How can we know something’s broken before it’s broken?”
“Because we can see when it’s starting to break,”
“But if we know it’s going to break, why haven’t we already fixed it?”
“Because it’s the first time it’s happened. But the system’s clever enough to tell us that it’s seeing something unusual that we should probably look into. For those things that have broken before, if we can’t fix the underlying issue, we can use automation to trigger remediation. But we want to be careful with that as it’s just like a sticking plaster really.”
He agreed he would spend some time with me looking at a demo so I started setting up a pilot for him and his team. May as well as I could show him how to migrate at the same time. I was just doing some final checks when I saw an alert pop up in Moogsoft as I saw Jeff frantically waving to a couple of his engineers, shouting across the office to where they were making coffees that there was a severity one incident.
I picked up my laptop and wandered over to his desk.
“May I assist?” I asked, as he frantically clicked on dashboards on his multiple screens.
“Sure,” he said, not even looking up. “Just about to create a ticket.”
“I’ve already done that,” I said. Now he looked.
“It’s just one click from the alert in Moogsoft,” I said, and showed him. He opened his mouth to ask him something but I cut him off. “Let’s not waste any more time on that now though, if it’s a severity one, which it looks like it is.” I nodded at the swathe of red across his monitoring dashboard. “We should focus on solving it.” I pointed at the alert on my screen.
He was distracted though, by one of his support engineers shouting from his desk:
“The APM’s saying it’s the application. It’s lost connectivity to the BACS service running in the Austin data center.”
Then his network engineer chimed in.
“We’ve lost one of the firewalls in the Austin data center.”
“Well, which is it?” asked Jeff as his mobile started ringing. I could see from the caller display it was our CIO, Charlie. He took a breath and picked it up, answering it. Charlie, amiable as he is, is not at his most fun to deal with during a severity one incident. He becomes almost robotic and demands fixes that match the SLA regardless of what’s going on behind the scenes.
“Charlie,” said Jeff. “Yes, we are on it. Not yet, no. We can’t be sure of that. Well, we’d be hopeful of that but since we don’t know what it is yet…” I could hear a raised voice and was certain I caught the name of our CEO. There was a reason Charlie couldn’t take incidents lying down. He had his own boss and his pay was tied to system performance. “Um… Actually, he’s already here,” Jeff swung in his seat to look at me. “Ok. We’ll call you back.”
“That was Charlie. He said to ask you to help. And also we have to fix it in the next 22 minutes.”
“Take a look at this,” I showed Jeff the Moogsoft alert and beckoned the engineers over. “Those alerts you are seeing are correlated. I’m certain they are part of the same incident. Can we take a look at the firewall?”
“Sure,” the network engineer said, as Jeff brought up a new screen. “I just manually restarted it.”
As it came back online, we saw the alerts fade on the screen. “I’m guessing the firewall being down blocked application server connectivity in the data center. Any ideas why?”
“I’ll close the ticket,” said Jeff.
“Well, let’s wait a moment until we have an action from this, which we’ll call our retrospective,” I suggested.
“Ok, but we only have 17 minutes left on the SLA,” said Jeff.
“Well, it’s a little behind on its patches,” said the engineer. “I’ve had an open ticket in the backlog for weeks now to update it.”
“Prioritize that and do it next,” said Jeff. “Wow. We really have to concentrate on getting that technical debt paid down. We should really get some patch management sorted too. There’s just so much going on, all the time.”
“I’ll update the ticket,” I said, showing Jeff how to do that, and how closing it in Moogsoft closed it in the service desk too. “It’s bidirectional, so no manual duplication for you. That should save you a bit of time.”
“It will. And I like how it helped us see why we needed to address that particular issue. It’s by no means the only upgrade we are lagging on. Or the only type of technical debt we’re suffering from. Justifying spending time on it’s always really hard though.”
“I bet!” I said. “You can look for trends though quite easily now on how often it’s causing you a problem. And the time you’re saving finding the problem faster you can use to pay down more technical debt too. Eventually, you’ll find yourself in a place where you have enough time to focus on improvements as well as unplanned work, BAU, and paying down technical debt.”
“That’s the dream,” said Jeff. We completed the rollout of the new service desk with Moogsoft integration in the finance department later that same month.
Want to learn more?
Register and attend the live webinar Intelligent Observability: What The Analysts Say this Thursday, July 8th at 9am PT | noon ET | 5pm BST.
About the author
Helen Beal is a DevOps and Ways of Working coach, Chief Ambassador at DevOps Institute and an Ambassador for the Continuous Delivery Foundation. She provides strategic advisory services to DevOps industry leaders and is an analyst at Accelerated Strategies Group. She hosts the Day-to-Day DevOps webinar series for BrightTalk, speaks regularly on DevOps topics, is a DevOps editor for InfoQ and also writes for a number of other online platforms. Outside of DevOps she is an ecologist and novelist.