This is the sixth in a series of blog posts exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this chapter, our SRE, Dinesh, speaks with Datadog at AICon about how to supercharge triage to reduce MTTR and increase system stability. Read all Observability Odyssey Chapters by Helen Beal.
When I asked Charlie for permission to attend this year’s AICon (virtual, natch) I thought it would be a shoo-in; learning’s part of my OKRs after all. But he never makes things easy and his ‘yes’ came with a caveat that’s typical when dealing with him. This time, he claimed he didn’t have the budget for the ticket (a likely story!) and I’d have to find another way to get one.
I wondered if any of our vendors that were sponsoring the event had extra tickets, so I reached out to my pal at DataDog, Morgan, to ask if they had one going spare. They said no, but they did have a speaker slot and if I joined them as a speaker, I’d get a speaker pass.
Now, I don’t do a lot of public speaking and, like most people, the thought of it makes my knees knock a little bit. But I really, really wanted to be at this conference, what with my data science Ph.D. on the side. And the keynote was one of my AI heroes, Captain Michael Kanaan - only the AL and ML lead at the U.S. Air Force. In another life, I’d have been a fighter pilot. Top Gun’s my favorite film. Anyway.
Those pesky OKRs also had a bit about publicly sharing our experiences. I’d asked Charlie about that when he asked us all to put that in there. I’d also noted that giving us OKRs is hardly in the spirit of autonomy and empowerment - but he said it’s a collaborative process and this had come from the very top - the CEO no less! I’m all about sharing, but in my experience, this has mainly been with our colleagues - not washing our dirty laundry in public. And while we talk about sharing success stories, what makes a good story is peril, suspense, and redemption. The kind of ‘all fine here’ image we generally want to show as a listed company does not make a good story.
Charlie said that the CEO had heard that transparency is all the rage. And that being open was good for our image (as long as we resolved any issues in the right way and it wasn’t just moaning!) and the HR, sorry, people team were pushing him as it’s seen as a great way of attracting new talent to our business. The engineering blog’s also been stood up for this purpose.
So I’ve been working on my presentation skills. I’m pretty nifty with slides as it happens so I focused a bit more on my speaking skills. And the whole being on stage thing. But it’s not really a stage, it’s Zoom and I’m on that all the time so how hard can this be? I’ve been watching a lot of Toastmasters on YouTube and Morgan and I built a slide deck together. We’re on in one minute.
And we’re live! Morgan does the intros - they’re a product manager at DataDog and we’ve titled our talk “SuperCanineCowMan: Integrating DataDog and Moogsoft for SRE Heroes”. And I’m up. I definitely got an adrenalin buzz but I’m on familiar ground so I feel confident as I set the scene.
“I’ve been using Datadog for around five years now, in three different roles so I was really happy when it was already in place when I arrived at C&Js. I’ve been loving the marketplace since they launched it last summer. It’s been a great way to find out about amazing tools that integrate with DataDog and it’s been exciting watching new additions pop up in there it seems like every day.”
I handed over to Morgan here and they gave a potted history of the marketplace and all that. Then: “Today we’re going to focus on the Moogsoft integration though and Dinesh wants to share some stories with you about what’s been going on at C&Js.”
“I sure do!” I said as I took back control of the slides. “Observability is a hot topic for me as an SRE, and Datadog’s insights have got me out of heaps of trouble in the past. But we all want to be faster. It’s uncomfortable triaging incidents when the clock’s ticking and we all hate letting our customers down. I’m interested in anything that can get me closer to guaranteeing a sublime and delightful experience.” I saw myself smirking a bit in the video at that. Possibly the downside to not having to actually stand on an actual stage - having to watch your own face. I do want our customers to be having a lovely time for sure, it’s just sometimes the marketing speak gets a bit much for me.
“Also, I’m really interested in AI. Right, you all get that.” My attempt as a joke. We’re at AICon! We’re all in this together. Morgan chuckles so that helps a bit with the lack of real-time audience feedback.
“We’ve got a ton of monitoring tools at C&Js, not just my favorite, Datadog,” I say with a wink and Morgan chuckles again, encouragingly. “I counted them all last week and we currently have 24 different monitoring tools that I could find - monitoring all sort of different things of course, but that’s a hella lot of data. Of course I want to aim for zero downtime and zero incidents, I’m an SRE after all, but I also totally support my compadres who want to pump enhancements through the system to get those value outcomes realized by our amazing customer base. And the reality of that is that things do go wrong.
“Our DevOps experts are doing all they can to catch and fix problems early, and I know they never let a known defect downstream, but this is about unknown unknowns. We also practice limited blast radius techniques like canary release but bad things do happen, even to good people.
“Combining Datadog and Moogsoft together creates SuperCanineCowMan and now I can be an SRE hero. Thanks Morgan! Using Moogsoft’s algorithms in conjunction with Datadog’s insights automates the identification of critical, actionable data. We also ingest data from some of those other monitoring systems and enrich that too. Then meaningful alerts are correlated into context-rich incidents. I can then quickly resolve them from within the Datadog Incident Dashboard. Let me show you something that happened recently with our cloud migration team.”
I did a live demo and it worked just fine! Morgan had warned me, and I’d seen enough to know they weren’t just being super nervous, that live demos are notorious for going wrong. We had a ton of questions from the audience on the event chat channel straight after and made some new friends. I got some answers to some questions I was struggling with from one of the other speakers and I got to see the keynote at the end of the conference with Captain Kanaan and won a copy of his book, ‘T-Minus AI’. A bunch of people connected to me on LinkedIn - we’re interviewing one next week for Aparna’s team! All in all, a good day. And I got to check off an OKR. Charlie was pleased.
Want to learn more?
Register and attend the live webinar Intelligent Observability: Making Toil a Thing of the Past this Thursday, May 20 at 9am PT | 12pm ET | 5pm BST.
About the author
Helen Beal is a DevOps and Ways of Working coach, Chief Ambassador at DevOps Institute and an Ambassador for the Continuous Delivery Foundation. She provides strategic advisory services to DevOps industry leaders and is an analyst at Accelerated Strategies Group. She hosts the Day-to-Day DevOps webinar series for BrightTalk, speaks regularly on DevOps topics, is a DevOps editor for InfoQ and also writes for a number of other online platforms. Outside of DevOps she is an ecologist and novelist.