This is the second in a series of blog posts exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this post, meet our clever ITOps engineer, James, as he reduces noise and distraction using intelligent observability.
Another day, another sev. 1. Something feels different today though… It’s not just that it’s my twenty-year anniversary at C&J’s, although that in itself is a bit of an achievement, if I do say so myself. Some of my mates laugh at me for still working at the same company I joined fresh out of college with a mechanical engineering degree, but, as I tell them, it’s never been boring. C&Js is one of those vast global conglomerates nobody really knows about, but half the stuff in your kitchen and bathroom cupboards is made by us. A couple of brothers, Connor and John, started a grocery cart in Belgium in the mid-1800’s and things kind of snowballed from there. When I joined, as a noob on the support desk, the internet revolution was just starting, people were still giving Amazon a hard time for not turning a profit.
Talking of Amazon, I’ve actually come into the office to see for myself the impact of our cloud migration. There’s hardly anyone here of course, social distancing, masks on, hand sanitizer everywhere and all that. Our data center is not yet visibly shrinking but I’m here to decommission the first machine.
I’d made it all the way to lunchtime without any unplanned work – a miracle in itself. In the old days I’d’ve hoped to be taken out for a Texan barbeque or something fun in Austin, where we are in the HQ. I still miss home, England, and our world-beating pubs and warm beer, but they’re not open anyway. So my celebration lunch is a Sev 1 and a pizza I called in.
So, the sev 1. It’s been a roller-coaster ride during my two decades – most recently I’ve been working with the cloud infrastructure team we formed about six months ago. Aparna leads it – she’s been here nearly as long as me – she joined me on the support desk back in the day. It was all AS400s then. We still have a few of them but they’re all virtualized. She’s a bit of a force, Aparna. Pushy, some say. I admire her for her vision and tenacity. The respect is mutual – we still always make sure our teams are connected and we get to work together. She got an AWS architect certification in her own time, then petitioned the CTO to set up a team and then hired in four new cloud engineers. She wants to cross-train the department but they can never find the time to learn – never the time to save time. They’re too busy sprinting to stay still; trying to change the wheels of the car when we’re driving is tough.
BUT! She has just completed the migration of her first application into AWS. The machine I’m here to decommission is the one it used to run on. It’s an inventory management application that one of our smaller subsidiaries uses. She’s set up a Zoom call so I join.
On the screen is a sea of alerts. As usual I have a ton of emails too from all our different tools. My own screens are going crazy but I can’t see what’s going on. It’s like looking for a needle in a haystack. Alert fatigue is a real thing here. I can feel my eyes glazing over and my brain going numb.
“Hey, James, thanks for joining us,” Aparna greets me as my video pops up.
“No problem,” I say, giving her a wave. “Awesome, you’re here too, Dinesh,” I greet our SRE happily – this is a bit of a dream team moment we’ve got going on here. Aparna’s whole cloud crew is on the call too. “What’s going on?” I ask.
“It looks like transactions are failing from the inventory app to the on-prem database,” says Aparna.
“Pushed any changes through, lately?” I ask, tongue-in-cheek. It’s a bit like asking a user to turn their machine off and on again. The tech equivalent of “take an aspirin”. Always start with the basics.
“Nope,” says Aparna. “We’re giving it some time to bed in. Maybe there’s a problem with the database… I’ve got so many alerts from the monitoring systems – there’s so much noise I really can’t see the wood from the trees. I can’t believe we’ve broken it already. This does not bode well for our next experiment.”
“Dinesh, did you…?”
“I did, James. I did.” Dinesh is grinning and answered my question before I had a chance to finish asking it. Last week, we had an incident with one of our more recent acquisitions, Animapanions, and got to road test our new toy – our AIOps and observability tool, Moogsoft. We resolved the incident in record time and got back to something more interesting instead. After that, Dinesh had said he was going to set it up for this app for the next experiment. So glad he’d managed to get to it.
“Can I take over the screen?” Dinesh asked Aparna and then threw up the Moogsoft view of the incident. It correlated alerts from the database, the CICD toolchain, the cloud app and infrastructure. And the network.”
“Interesting,” I said. I could see the problem.
“What is this? What am I seeing here?” asked Aparna.
Dinesh explained: “This is some AI looking for patterns in the alert streams we’re getting.” Then he started singing, “I can see clearly now…”
“It’s not us, is it? It’s the network. I can see it too – that spike on the CPU and that Kubernetes container and all the network resets. Wow – that’s amazing. Talk about noise reduction – it’s cut right to the chase. Let’s get the NOC on the call.”
In less than five minutes, they were there too. Ten minutes after that, the network problem was resolved. Two minutes after that, we had another problem – another massive CPU spike as we hit the ‘thundering herd’ condition when all the failed transactions retried and succeeded. Aparna managed that via AWS and our database server took it – then the alert noise faded, and the system purred again. Our work here was done. I saluted the team and left the Zoom, hearing the familiar clack-clack of Slack as the screen closed.
Dinesh Soni 12:34 PM
That was pretty awesome fun, mate! You got a minute to look at something else with me? There’s something up with MQ…
Find out what happens next with Dinesh in the next episode, ‘A Day in the Life: Intelligent Observability at Work with a Super SRE.’
Want to Learn More?
We’ll unpack James’ challenges in a live webinar with Helen Beal and Moogsoft CTO/CISO, Dave Casper, on March 9. Learn how ITOps heroes, like him, use intelligent observability to overcome the noise of complex systems, develop more and operate less.
About the author
Helen Beal is a DevOps and Ways of Working coach, Chief Ambassador at DevOps Institute and an Ambassador for the Continuous Delivery Foundation. She provides strategic advisory services to DevOps industry leaders and is an analyst at Accelerated Strategies Group. She hosts the Day-to-Day DevOps webinar series for BrightTalk, speaks regularly on DevOps topics, is a DevOps editor for InfoQ and also writes for a number of other online platforms. Outside of DevOps she is an ecologist and novelist.