This is the fifth in a series of blog posts exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this post, James learns how to connect all his disparate monitoring and alerting tools to streamline his incident management process.
“Morning, mate,” I greeted Dinesh as he walked into the office. “Nice get up for the big day!” He was wearing a pressed shirt, rather than his usual hoodie.
“Thought I’d make an effort, you know,” he grinned.
We’d been planning intensely for this moment for the last week or so – our meeting with Charlie, the CIO, to present the results of our Moogsoft experiments and ask for permission to extend the rollout across the enterprise. We had slides and everything.
Charlie was already in the boardroom when we arrived. It was rare for me to set foot on such hallowed ground, I could probably count the times on one hand despite the 20 years I’d been here. Charlie had company – Lucia, our CFO. The nerves kicked in and I glanced at Dinesh, his forehead unusually shiny. I started praying our numbers added up – not my strongest suit.
We took our seats, distanced carefully around the vast, glossy table and I handled the chit chat while Dinesh wrestled with connecting his laptop to the connector. Once he looked settled, I started.
“We are gathered here today,” I began, and was pleased to see smiles wash across Lucia’s and Charlie’s faces – maybe this wasn’t going to be an uphill struggle. “To consider how AIOps can improve our organizational performance.”
I’d been in these situations before, many times, justifying investment in time, energy and money to senior business people who, quite rightly, didn’t necessarily understand the ins and outs of the technology I was proposing and I’d learned to start at a high level. Some of my bets over the years had been good ones, not all, but enough to keep me here this long. I think it would be fair to say I was a safe pair of hands.
The pace of change and the available tech had exploded in recent years and it was pretty impossible to keep up and do ‘the day job’, so having Dinesh here, who was relatively new to the organization but I had heard Charlie describe as “a breath of fresh air” was calculated to counteract any concerns people might have about me being too institutionalized, or not up with the latest and greatest. Dinesh had been hired on the strength of his leading-edge, up to the minute skills and, although he sometimes expressed frustration at the pace of change and the high levels of legacy systems, ways of working, and technical debt, he was hired as pivotal to our digital transformation. As such, my assessment that we would be considered trusted advisors on this should be well-founded.
“The key problem we are dealing with here is unplanned work. Our plans to accelerate the flow of value to our customers are stymied when our people are under so much pressure to keep the lights on. We believe we know how to reduce the time it takes to deal with these events and incidents, to allow the teams more time and insights into how to tackle the causes of the unplanned work, pay down the technical debt and start building systems that take care of themselves, that will be reliable, robust and antifragile.”
Lucia raised an eyebrow at the last word I used. I made a mental note to rein the jargon in.
“Dinesh has prepared some data for you to review, based on the experiments we have been undertaking with the cloud-native product, Animapanions, the cloud migration team and some of our legacy systems.” I smiled at him and he sat up just a little straighter as he took the floor and flicked to his first slide on the screen.
“Sarah, the DevOps engineer at Animapanions, has been using Moogsoft for around a month. During that time, they have experienced around twenty incidents -”
“Oof,” said Charlie. “Why so many?”
“They push new code to live on demand, typically around five times a day. They have automated almost all of their unit and integration testing and have made massive inroads into their user acceptance and security testing too. They have two main problems though that are giving them around a 20% change fail rate. One is that they have to deal with a number of 3rd party systems, including our own since we moved their warehousing and delivery to ours, some like payment gateways, which they can’t control.”
“Fair enough,” said Charlie, steepling his fingers. “The other?”
“Despite being a relatively young company, born on the web if you like, and using most of the latest and greatest technology, they do still suffer from technical debt.”
“Ha!” said Charlie. “I doubt they know the meaning of the words! Surely that pales into comparison to what we’ve managed to build up over the decades? And surely they should have known better?” He looks at Dinesh.
“It’s pretty typical actually,” he responded. “Market disruptors like Animapanions usually put themselves under a lot of pressure to get those differentiating features out and architectural and code shortcuts are often made. And once they’re done and the increment is released to live, it’s often onto the next new thing that’s going to win them more customers, and those improvements or little fixes often go unattended for quite a while. That’s partly why I and the SRE role exists. You may also have noticed that I didn’t mention performance testing in the context of what they’ve automated in their CICD pipeline.”
Lucia was starting to glaze over just a little bit. We were slipping down the jargon slope again.
“So tell Charlie how you helped Sarah fix it,” I nudged Dinesh.
“Right,” he said. “On this slide, what you can see is Animapanions’ MTTR.”
“That’s Mean Time to Recover, Restore – there are a few Rs,” I said for Lucia’s benefit, hoping she wouldn’t find this patronizing. “Basically, how long it takes us to get back up and running when something breaks.”
“Thank you,” she smiled warmly.
“You can see, this data shows it’s dropped to nearly zero.” He flipped to the next slide. “Here’s the internal wiki page where she wrote up the first issue Moogsoft helped her fix. Essentially a new mobile browser version caused a problem with our API to PayPal which meant that some users weren’t able to check out. Here’s what Sarah said about this: “We would have been there for hours without the Moogsoft AI. It took the data from all of our monitoring systems and made sense of it. It spotlit the problem, took all the usual guesswork and frustration away and guided us almost instantly towards the right fix path.” She also wrote a public-facing account of her experiences over the last couple of weeks on our engineering blog.”
Lucia looked perturbed. “We do that?” She asked. “Isn’t it a bit dangerous to wash our dirty linen for everyone to see?”
“I asked the engineering teams to do that,” responded Charlie. “There are lots of reasons why: motivating them to share success stories, showing our customers transparency, gaining a reputation for technological advancement to inspire our own and attract new talent…”
“It’s had over fifteen thousand views, and nearly two thousand shares on social media. There are thirty-two comments, all positive,” Dinesh added. “The reduction in MTTR is time straight back for the team – in Sarah’s case, those 20 incidents used to cost the team on average 8 hours work and the outages themselves would last on average around 20 minutes. So that’s 160 hours saved per month, assuming zero, so let’s say it’s actually reduced to 7 hours work per incident, that’s 140 hours back. A 20-minute outage at peak trading is just shy of $1,000 per minute to $20k per incident, so potentially $400K per month regained through not having the system down.”
“$4.8 million a year?” Lucia leaned in.
“Yes,” I said, “and that’s not even considering the reputational damage and how that impacts customer loyalty, our net promoter score, reviews and referrals.”
“So, what happens if we extrapolate that number across the whole of C&J’s?” asked Lucia, I could almost see her frontal lobe glowing as she started to make the calculations. Fortunately, we’d already done them for her.
“We have made some assumptions – we’ve taken the same sort of proportional reductions across the business and teams’ capabilities do vary. For example, some have a lower change fail rate, but they also have a much lower deployment frequency – something they’ll be trying to address as part of the digital transformation in order to continue to compete. But here are the headline numbers as we see them.”
“500 teams and ten thousand people, but you’ve said half the teams are platform teams and haven’t associated a transaction regain number with them. What does that mean?” asked Lucia.
“As part of the digital transformation and DevOps journey,” Charlie replied, “we are identifying which teams are value streams, so revenue-generating, and which are supporting them – they’re the platform teams.”
“Oh lord!” said Lucia. “Not another reorganization?” CFOs, in my observations over the years, aren’t a big fan of big-bang transformation and the new organizational designs that typically come with them. The negative impact on productivity can hit the bottom line pretty hard.
“We are aiming for a much more evolutionary and incremental approach. But as you can see from Dinesh and James’ numbers, we’re thinking this improvement could still represent up to $1.2 billion not lost in transactions and it looks like a further 840 not wasted looking for a needle in a haystack.”
“Those are some big numbers,” Sarah acknowledged.
“And we haven’t even talked about what the teams can do with that time when they’ve reduced their MTTR. Those are just the hard savings. We’ll have happier people, delighted customers so more of them buying more. Our teams will find themselves in a virtuous circle where they can spend the time they’ve saved making more system improvements for customer experience, and the tech can help them find them.”
“Where do I sign?” laughed Lucia. Charlie winked at me.
About the author
Helen Beal is a DevOps and Ways of Working coach, Chief Ambassador at DevOps Institute and an Ambassador for the Continuous Delivery Foundation. She provides strategic advisory services to DevOps industry leaders and is an analyst at Accelerated Strategies Group. She hosts the Day-to-Day DevOps webinar series for BrightTalk, speaks regularly on DevOps topics, is a DevOps editor for InfoQ and also writes for a number of other online platforms. Outside of DevOps she is an ecologist and novelist.