A Day in the Life: Intelligent Observability at Work with a Super SRE
Helen Beal | March 23, 2021

This is the third in a series of blog posts exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this post, meet our super SRE, Dinesh, as he seeks to eliminate toil using intelligent observability.

This is the third in a series of blog posts exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this post, meet our super SRE, Dinesh, as he seeks to eliminate toil using intelligent observability.

After we’d fixed Aparna’s network issue, James came to see me at my desk. Masks on, socially distanced and all that, but it was nice to have some face-to-face time. James is cool – that dry British humor and not your classic IT Ops dude. He’s been here forever and mentored me when the CIO, Charlie, hired me as the first SRE here a year or so ago. I lucked out really. It was good working at my previous company, and great that I was able to upskill from sys admin to SRE, but, frankly they were a bit of a mess. I wasn’t really able to do my job properly as nobody ever made the time to save time. There was too much pressure from the business to get new features out and never space to work on antifragility. They were old school and I was worried that C&J were going to be the same – they have been around for over a hundred years after all. But Charlie seemed to have the right vision for their digital transformation and the package was good so…

“What’s up with the old MQ?” James asked as he sat down.

“These legacy systems…”

“Cherished,” corrected James with a wry smile.

“Take a look at this,” I said, pointing at the screen with Moogsoft displayed on it. When I said that James isn’t your typical IT Ops guy, this is an example of what I meant. The IT Ops team I worked with before, they were so busy with unplanned work, hair-on-fire moments, we were just lurching from one disaster to another. Almost exclusively caused by the huge amount of changes development was pushing through to us. Not only was the quality of the code not great, the technical debt was enormous and building every time a new change came through. To be fair to the guys, I think they were just burned out. The staff turnover was massive and new people would arrive bright and bushy tailed but, give it a couple of months, and they’d look exhausted and frustrated too.

It’s different here. There’s still massive amounts of technical debt, don’t get me wrong, but these guys can see it. And they’ve got control over the work coming through – the development teams are working with the IT Ops teams to visualise the flow. And James doesn’t get in a flap when things go wrong – he’s really calm. And he fights to invest in things that’ll make things happen faster. That’s why he sponsored me to the CIO to pilot Moogsoft.

 

Dinesh SRE Hero

 

While we were on that call with Aparna, I saw another alert relating to the message queuing – an ancient but pretty bullet proof to-be-fair tech that connects our inventory and ecommerce systems. Actually it connects most of our systems. The monoliths anyway. I was chatting with Jorge last week – he’s the chief architect here. He was telling me about his master plan to break down the monoliths using the strangler pattern. He thinks he can break all these dependencies. Goodbye CABs and hello team autonomy. Hallelujah! In the meantime though, we still need to manage this.

So Moogsoft has this thing where it uses time series components, things like periodicity models and data seasonality. You know, like Friday nights are a spike for Netflix. And Black Friday’s really busy for retailers. But there’s quite a lot of subtlety to it and this is why we often get so many alerts that don’t mean anything – there are natural spikes and waves in performance but most monitoring systems aren’t clever enough to get it. Moogsoft compares a sequence of events that’s occurring to a sequence that happened in the past that was bad. This pattern analysis gives us an early warning, pre-empting failure, if you like. It’s not prediction – that’s impossible – and if we were able to predict, that would show us we weren’t doing a very good job since, if we could predict it, we should have already fixed it.

 

SRE Observability

 

“The dynamic threshold’s being breached,” said James.

“Yeah,” I said. “I think we’re about to have a wobble.”

I’d barely finished my sentence when my other screen, that displayed a dashboard I’d created to bring together the multitude of alerting tools that have been collected over the years, went bonkers.

“I’ll just do a restart,” I hit the keyboard and got on with it. One of the things that is great about MQ is that it has assured delivery. You can do things like this and not worry that you’ve lost any transactions. It remembers where it got to and will start where it left off. The restart worked.

“Ok, that’s great. It’s getting a little flaky in its old age,” said James.

“Yeah. Jorge’s thinking of using a new cloud or open source MQ platform in his new microservices empire,” I said.

“Good call,” And this is another example of how awesome James is. In my old place, the IT Ops guy would have walked away at this point. Fire out, job done. But James isn’t done yet. “You had to do that a lot lately?”

“A couple of times last week.”

“And we don’t know what’s causing it?”

“I have a couple of theories but it’s going to take me a bit longer to figure it out. I was getting close to the end of my experiment just then and then it fell over so I’ll need to start the test again. In the meantime, I’m thinking we could automate a fix. Restarting is boring toil.”

“That is the rule of three, after all. Done it three times, time to automate. What are you thinking of doing? Restart or redeploy or raise the threshold?”

“Automated restart. Sounds a little scary but we’re getting this early warning so I should know when it’s about to happen. It only takes the system down, and it’s going down anyway, for a few minutes and it’ll catch up where it left off. The impact on the users should be minimal.”

“Good call,” James nodded. And then took things another step further. “What do you think about making MQ a candidate for cloud migration with Aparna? Maybe we could get Jorge involved too and see if we can try switching some out for a newer alternative. Another little experiment.”

“I like it,” I said, grinning under my mask. I could have fixed this on my own, but it turns out sharing and collaborating has all sorts of benefits.

Find out what happens next with our team at C&Js in the next episode, ‘A Day in the Life: Intelligent Observability and Cloud Migration’.


Want to learn more?

We’ll unpack Dinesh’s challenges in a live webinar with Helen Beal and Moogsoft Director of SRE, Thom Duran on March 25. Learn how SREs, like him, use intelligent observability to overcome the noise of complex systems, develop more and operate less. RSVP Now!

Moogsoft is the AI-driven observability leader that provides intelligent monitoring solutions for smart DevOps. Moogsoft delivers the most advanced cloud-native, self-service platform for software engineers, developers and operators to instantly see everything, know what’s wrong and fix things faster.

About the author

mm

Helen Beal

Helen Beal is a DevOps and Ways of Working coach, Chief Ambassador at DevOps Institute and an Ambassador for the Continuous Delivery Foundation. She provides strategic advisory services to DevOps industry leaders and is an analyst at Accelerated Strategies Group. She hosts the Day-to-Day DevOps webinar series for BrightTalk, speaks regularly on DevOps topics, is a DevOps editor for InfoQ and also writes for a number of other online platforms. Outside of DevOps she is an ecologist and novelist.

All Posts by Helen Beal

Moogsoft Resources

April 9, 2021

Monthly Moo Update | March 2021

April 8, 2021

A Day in the Life: Sarah the DevOps Engineer and the Beauty of AIOps

March 30, 2021

Coffee Break Webinar Series: Intelligent Observability for SRE

March 24, 2021

Coffee Break Webinar Series: Intelligent Observability for IT Ops

Loading...