This is the first in a series of blog posts exploring the role that intelligent observability plays in the day-to-day life of smart teams. In our first post, meet a stellar DevOps engineer named Sarah as she discovers how to tackle anomaly detection using intelligent observability.
Thursday morning, and I’ve done some yoga, a ten-minute meditation and am at my desk in my hastily thrown up garden office with a mug of green tea by 08:30am. I’m really not missing the commute to our old HQ (now permanently closed, thanks to the pandemic) in the heart of Seattle and am enjoying an extra few minutes in bed and getting mindful before logging in.
I start by checking Slack. The company I work for, Animapanions, is an online pet store. We were acquired by a global conglomerate headquartered in Texas a year back. We are continuing to run most of our own systems, but our warehousing and delivery has moved to be controlled by our owning company. We process several thousand orders a day, shipping a few millions of dollars of product each month.
When I logged off last night, all was looking good but, since the takeover, we’ve started shipping internationally and run 24/7 so anything can happen while I’m catching my zzz’s. While I’m recharging, our teams in Manila (where the owning company’s central NOC is) take over making sure we can follow that sun.
I see an invite to a new Slack channel and catch my breath as this often means we’ve got a problem. Everything’s been strangely quiet since a new AI-driven observability platform was put in place a couple of weeks ago. It’s been great getting to some of the improvement backlog that was aging rapidly, and I’d been starting to wonder if we’d previously spent an unnecessary amount of time chasing our tails, but also worrying it was some kind of calm before a storm…
I sip my tea and dive in. The new Slack channel is called ‘Moogsoft Incidents.’ There’s a message from the IT Ops guy aligned to our product team, based out of the Texas HQ:
James Parker 08:02 AM
An anomaly has been detected in the Animapanions customer journey – checkout. @Sarah Edwards, please can you take a look and let me know what you think? The app’s still up atm…
I’m the Sarah that James is tagging, and there’s a link to an incident which has created a Jira ticket. First, I want to see who else is in this channel. I’m a DevOps Engineer and I’ve been on the team since we created the original app. I was involved in the shopping basket and checkout dev so it makes sense that James is asking me to look into this. The problem is that we’ve always operated with a “we build it, we run it” mentality, so it puts me slightly on edge to have someone I barely know effectively looking over my shoulder. I grudgingly thank him in my mind though, for alerting me to what might be a problem we need to deal with.
Also in the channel is a support engineer from the Manila NOC and support center, Lamar Ramos. But nobody else from my team. I bring Dinesh, our SRE, into the Slack channel too and make sure he’s got access to the Jira ticket. He’s new to the team and doesn’t know the code like I do, but he’s a skilled programmer and he’ll want to know about this as it could be impacting our SLOs. Plus, it was he and James who implemented the AIOps tool.
I take a look at the incident, digging out from the depths of my memory the short demo of this tool James and Dinesh did for the team a month or so ago. What I see in there makes me sit up straight and my brain starts to whirr.
Sarah Edwards 08:23 AM
Thanks @James Parker – I’ve taken a look and @Dinesh Soni is also here. What I think I can see is that Moogsoft’s found an anomaly on the new Samsung Galaxy S21 11.0 Android browser that released on Monday…
James Parker 08:25 AM
Yep – it looks like a few customers haven’t been able to complete checkout – any clue what’s going on?
Sarah Edwards 08:27 AM
I’ll spin up a test in Applitools. @Dinesh Soni – you want to jump on a screenshare?
What I’d seen, when I’d linked to the incident, was a number of alerts coming in from New Relic, some Splunk log files and some data from Prometheus. We undoubtedly had a problem, but where? Dinesh and I did some triage with Applitools but couldn’t find anything in the UI, so we went back to the Moogsoft incident. Moogsoft showed us that the transactions were breaking at handoff to PayPal – but it looked ok to ApplePay. Was PayPal down? It couldn’t be, because the transactions were working just fine in the other browsers…
Back in the Slack channel, I queried our CI server to see what changes we’d pushed into live yesterday. There were a couple of small features from two of our US based developers, Priti and Sean. I invited them into the channel too.
It turned out that Sean’s change had included an API update which had broken the connection to PayPal in the new version of the Android browser used by that particular device. He made a quick code change and pushed it through the CICD pipeline while we all watched the progress through the Slack based ChatOps platform. I brewed another cup of tea while we waited for the tests to pass and he manually pushed it to live. The Moogsoft alert switched off and we saw the transactions going back to normal.
Sarah Edwards 09:38 AM
And we’re back! @Sean Davis you’re a superstar. Thanks so much for fixing that so fast. Quick retro everyone? Then we’ll attach this conversation record to the Jira ticket Moogsoft created
James Parker 09:39 AM
Amaze. I’m pretty happy with how that went.
Sean Davis 09:41 AM
Me too. Sorry for the issue guys – not sure I could have anticipated it but I’ll work with the testers, probably Ling, to see if we can add some automated integration tests to the CICD pipeline to pick these up. I have to say, I feel like we would have been thrashing about in the dark trying to figure out what was going on without that intel from Moogsoft. Thanks for setting that up guys.
Dinesh Soni 09:45 AM
Yep. I’ll second that. Looks like the Moogsoft experiment’s working in line with our hypothesis. Time to take it to the next level, @James Parker?
Find out what happens next with James and Dinesh in the next episode, ‘A Day in the Life: Intelligent Observability at Work with an ITOps Hero’.
Want to Learn More?
We’ll unpack Sarah’s challenges in a live webinar with Helen Beal and Moogsoft Chief Evangelist Richard Whitehead on February 25. Learn how DevOps pros, like Sarah, use intelligent observability to overcome the noise of complex systems, so they can develop more and operate less.
About the author
Helen Beal is a DevOps and Ways of Working coach, Chief Ambassador at DevOps Institute and an Ambassador for the Continuous Delivery Foundation. She provides strategic advisory services to DevOps industry leaders and is an analyst at Accelerated Strategies Group. She hosts the Day-to-Day DevOps webinar series for BrightTalk, speaks regularly on DevOps topics, is a DevOps editor for InfoQ and also writes for a number of other online platforms. Outside of DevOps she is an ecologist and novelist.