A Day in the Life: Sarah the DevOps Engineer and the Beauty of AIOps
Helen Beal | April 8, 2021

This is the fourth in a series of blog posts exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this post, Sarah and company discover how AIOps gives them "the time to save time!"

Back to back Zoom calls as usual today, but this next one I’m pretty excited about. It’s three weeks now since that incident, the API change that broke the Paypal connection in Android, and I haven’t seen a lot of our SRE, Dinesh. I have to share him with a few other teams and one of them, Aparna’s cloud team, has been keeping him really busy with a legacy migration into AWS. Next up, though, is an AI masterclass I asked him for as, while he’s been largely absent, he hasn’t actually been needed by my team all that much, but I want an opportunity to pick his brains.

Since he and James put the Moogsoft pilot in, our MTTR has been close to zero and consequently we’ve been able to use all that time not spent on unplanned work on paying down technical debt and reducing toil ourselves like Dinesh suggested. Our product owner is over the moon as we’re now at a point where we’ve managed to inject more capacity for innovation – we’re taking on more story points per sprint now than ever before. And I’ve had time to tinker about with integrating Moogsoft into the DevOps toolchain which is also partly what this next session is about. And because I’m curious to get under the covers of this AI and understand how it’s doing what it’s doing.

“Hey Sarah,” Dinesh greets me, his happy, smiling face on screen and ready to go as soon as login and we chat for a few minutes while we wait for Aparna and the rest of her crew to finish whatever call they just had and dribble into the Zoom. I know some people are complaining about Zoom fatigue, but really I’m just super grateful that I get to work from home and not commute and that it’s such a gift to be able to get a bunch of amazing people together in an instant no matter where they are in the world.

“Me and James have a meeting coming up with Charlie next week. We’re pitching him for an enterprise wide roll out of Moogsoft so this is a bit of a dry run for me today – thanks for asking me to set it up.” Dinesh looks pretty excited about his plan to chat with our CIO.

“Wow,” I responded. “That’s a pretty big deal!”

“I know, right,” he says with his lopsided smile. “It’s fifteen countries, over five-hundred teams and close to ten thousand people. It’s not just the software licence fees, but all the time and effort. Do you think you can hit me up with a success story?”

“Of course! Let me post some links in the chat. Here’s the internal wiki page where I wrote up  that first incident you and James sorted with us. And here’s an article I posted on our public facing engineering blog last week. Also, here’s a link to my experiment documentation for integration into the DevOps toolchain.”

“That is so awesome,” he says, beaming. “We’ll be making your local discoveries global improvements in no time! After this, let’s set up some time for just the two of us as I want to know more about the DevOps toolchain plan. Looks like everyone’s here now, so let’s get going.”

Dinesh starts by doing some intros and I learn some new things about him straight off – like he’s studying for a data science PHD in his own time. To add to his existing computer science degree and neuroscience Masters.

“So, enough about me,” he says, “welcome to the first of what I hope will be many of my AI masterclasses! I’m really grateful for you wanting to do this with me, thank you, as I love showing off.” We all laugh, with him. “Seriously though, this is a really fun way for me to self-assess how well I’m learning and I’m super geeky about this so I’m hoping I’ll get you as excited about AI as I am and then I’ll have more people to talk with about it.”

His enthusiasm is infectious and I’m even wondering if I could carve some time out to do a PHD too. Data scientists are highly sought after and highly remunerated after all.

“The great thing about Moogsoft is you totally don’t have to be a data scientist to benefit from all the work the data scientists have already done for you. But it’s fun to know what’s going on under the bonnet. Today, I thought we’d focus on noise reduction. As I’m sure you’ve all experienced, sometimes having so many alerting systems is a real problem – there can be so many false positives and false negatives it can be really hard to get to the crux of what’s going on. Moogsoft reduces this noise by automatically applying statistical calculations and noise-reduction algorithms to that bountiful alert data. And there’s our first AI buzzword: algorithm.”

“Sounds complicated,” says Aparna.

“It does, right?” agrees Dinesh. “But it’s not – it’s just a set of instructions. The algorithms that are reducing noise do quite a few different things. First off, there’s deduplication. That’s nothing new, it’s been around for years, but it’s simple and effective so we start there. Essentially, every time a repeat event is encountered, a counter is incremented on the parent alert, and the repeated event is discarded. So, hundreds of ping fail events collapse to a single alert. With me so far?”

Nods and thumbs up all around. Looks like everyone’s hanging on his every word! I know I am.

“What’s next?” I ask.

“Next up, is alert correlation which is all about pattern discovery across technology stacks.”

“So network, server and application?” I asked.

“Yes,” confirmed Dinesh. “And in your newer environments and Aparna’s cloudnative pieces, microservices and containers too. And it’s not bamboozled by all these cross-enterprise dependencies with have either.”

“It can help us manage the dependencies while we learn to break them,” I murmure.

“The correlation algorithms weigh across multiple factors. The Cookbook algorithm, for example, creates clusters of alerts based on how alike or dissimilar they are. It uses factors like time, class or type, geographic location, topology proximity, and server priority.”

“So it zips up alerts into batches?” Aparna asked.

“Yeah, kind of. But as well as vastly reducing the number of tickets you’ll get, the correlations also teach the system to identify where root cause, or causes, are likely to be and make recommendations on how to fix it.”

“So we can automate those fixes!” I’m getting excited. This is sounding like the closest I’ve been able to get to self-healing systems in my career.

“Totally!” said Dinesh, nodding frantically. “And this is also helpful when we get to higher levels of capability and we’re not just trying to fix immediate, urgent issues, but also looking for other improvement opportunities.”

“Imagine that!” says Aparna. “Having time to save time!”

“I know, right,” smiles, Dinesh. “But that’s what happens when your MTTR is close to nil, and you’re barely having any change failures or outages anyway. It’s a self-fulfilling prophecy.”

“What else does it do?” I ask, I’m getting seriously hooked on this stuff.

“Check out this visualization,” Dinesh said, sharing his screen. Of course, I’ve seen this before but it’s interesting to see the ‘wow-moment’ on their faces that must have been on mine just a couple of weeks back. “And you can fine-tune this too.”

“Oh,” I say, “that’s really useful. When I’m thinking about how to integrate this into the DevOps toolchain, and that we want to shift all testing left and do it as early as possible, I think what you’re saying is that the system will know where it is in the route to live and how important that stage or gate is to us. Can we fine tune it if we really want to amp up building that quality early?”

“For sure,” says Dinesh. “Let’s look at that before my meeting with Charlie. See you all at the next masterclass?”

“For sure!” the team echoed.

