Coffee Break Webinar Series: Intelligent Observability for SRE
David Conner | March 30, 2021

A selection of live questions and answers from the audience of our recent webinar on how site reliability engineers can best leverage intelligent observability to monitor SLIs and SLOs, prioritize reliability over functionality, and more.

A selection of live questions and answers from the audience of our recent webinar on how site reliability engineers can best leverage intelligent observability to monitor SLIs and SLOs, prioritize reliability over functionality, and more.

Toil and trouble go together like peanut butter and jelly but are not nearly as sweet. However, the number of tools available and in use can be overwhelming, so SREs need to pick their fire-fighting equipment carefully. A recent webinar featuring DevOps Institute Chief Ambassador Helen Beal and Moogsoft Director of SRE Thom Duran explored why intelligent monitoring is one of these critical tools. 

The two examined the role of intelligent observability in monitoring SLIs and SLOs, automation and Rule of Three, plus more in an hour-long session focused on helping SREs optimize reliability in their apps, their ability to build customer value, all while maintaining healthy ways of working.

The full webinar discussion is available on demand. Below is a selection of live Q&A from the episode.


Q: How do you suggest approaching manual work that consistently repeats but requires a different approach each time to resolve it? An example would be migrating code from legacy environments.

Duran: If you find that there’s manual work, then consider that the target for an SRE is to be 50/50. I don’t think you’re ever really going to reduce toil to zero, especially when you’re working in environments that are unique or have changes from time to time, such as migrating legacy code. So if I look at some of the cloud infrastructure we’ve had, in the past with our enterprise product, it could vary dynamically depending on the customer. This meant that when we managed it for them, every build was effectively a snowflake.

So what we did was we found the core of what every build has, and we automated that core. Once that core was automated, then we started focusing on figuring out how we could make the next 20% of customers, or whatever the biggest slice of pie is for that manual effort, something we could automate as well. While it wasn’t 100% automated, I could at least get an environment out with 80% automation, and the remaining 20% was a unique one-off to that environment that was required to be manual or using bash scripts that we had which were catered to that customer.

While I think it’s a great target to go after to reduce toil by 100%, I don’t think you’re ever going to get there. We also had to strongly consider why what we’re doing every time is different, and is there a way that we can find those commonalities and in essence make them the same, and if there are variables, can we codify those variables?

In summary, focus on the core of looking at when you’ve done 10 things 10 different ways but 20% of it has been the same every time you’ve done it, then automate that 20%. If you see the next slice of pie is 10% across the board, automate that. You may not get to 100% automation on that task, but any automation is an improvement.

Q: SRE teams are generally ignored unless a fire is going on. So how do you show the value of what an SRE is doing when they aren’t being praised for quickly resolving an outage?

Duran: I think you have to show how you prevented outages or how you improved reliability, speed of load times, etc. A lot of people think that reliability means “keeping something up,” but what you really don’t think about is that if a user clicks a page and it takes consistently more than a second to load, you will slowly start to see a trend upward where people are just leaving your page. So how do you reduce that below one second? How do you tighten those time frames to response times, etc? This is a way that you can provide value back to the business and show “hey, we cut our load times in half and look at the Google analytics trend which shows that it actually kept people on our site longer, or on our app longer.”

We’re also responsible for CICD infrastructure and pipeline, so making sure that everything is buttoned-up and tuned. You really need to highlight those things. What I’ve found is that if it’s working, nobody is coming around to look under the covers of how it works. But if it's on fire, suddenly everybody is there under the rug and pointing fingers. So, show people how things operate normally and how you’ve continued to improve it over time, and really highlight those improvements. 

This is especially true around automation. When we first started on the Moogsoft Observability Cloud (MOC), we had a bottleneck where CICD was one person making changes to the pipeline as developers needed them. We came in and flipped that on its head and said “here’s the standard code and pipelines and libraries and here’s how they interact.” Now developers are in essence owning their own pipelines with our code while we manage the underlying infrastructure. What we’ve effectively done there is pull the blocker out that could have prevented a feature by hours or days. 

By doing this, you show the business that we’ve now enabled the developers to own their destiny, which is very DevOps, and they can start pushing through features faster. What I showed was a trendline of how much faster we’re pushing out features and changes through dev, through staging and into production, and then proving that we can make this better, and how we can make it better.

Q: What about SRE for on-prem or legacy systems: Is it doable?

Beal: Of course it is, yesterday I did a tech talk with Standard Chartered Bank and it was organized by their head of SRE transformation. So you can imagine that any global bank like Standard Chartered is going to have loads of legacy and lots if it is going to be on premises. In fact, I think there’s an argument for even more SRE in those environments because there’s more technical debt and more fragility to be dealt with.

Duran: I came from GoDaddy and we were literally all on-prem, and right about the time I left was when they started looking at AWS. My role there was really focused on being an SRE of monitoring. What we did was we had SRE teams that had focuses, and we also had SREs embedded on each product team. I was responsible for introducing monitoring and tools, and maintaining said tools for all teams.

Because these were all on-prem products, what we did from there was leverage a monitoring tool that would require code to monitor. We created Jenkins pipelines that would then wrapper that code into an RPM that they could then blast across all of their systems. We did things like take a bare metal box and carve it up, because we didn’t need all that hardware, and made our own virtual systems using open-source components, then used SaltStack to be able to reach into those systems.

I think what you need to look at when you’re on prem is what utilities are available to you. We’ve been on prem with legacy systems much longer than we’ve been in the cloud. So you need to steal from the historical sysadmins, blow off the dust of that book and look at what they were using: SaltStack, Ansible, Puppet. Ansible is something I love because you can get right in if you have SSH access to the system. Puppet is also fantastic because it will run continuously and keep everything in sync.

These are the ways that you can start to introduce the “modern GitOps model” on historical systems by meeting those systems where they are. Make your changes in Git, get it packaged in a way that it can be deployed to those systems and then use your deployment tools, whatever they happen to be, to get those packages out there. 

To be fair, it can be a massive lift for you when it used to be sysadmins doing very manual tasks. It can take some time to introduce this model, but you can certainly do it, and there are definitely ways to automate your way out of toil on prem.

Q: What have you done to assist your team with pager fatigue?

Duran: There are a couple of things that we always try to do. We always have a primary and secondary. With SRE being a mile wide as far as tech goes, I like to make sure that the primary and secondary have different focuses of expertise so that you always have somebody to lean on. Also, if you’re on call as primary, then that is it for you. You are not responsible for sprint work, or anything other than being a goalie in Slack for what comes up from the business, as well as for fighting fires. While that doesn’t fix the anxiety of wondering whether to go out to dinner with or without a laptop, it does help when you’re not trying to balance delivery and incident work.

On top of that, I like to have a culture where people are trading off on-call duties. I think overrides are very important so that one person can say “I’m going to go out for a couple of hours, can somebody cover me?” This trading off where people owe each other favors helps people speak up when they need a break and also lends to camaraderie around the team.

Another avenue for reducing pager fatigue is to reduce the number of pages being received. I recall a conversation with a network engineer during an outage who effectively said, “I could fix this a lot faster if I didn’t continue to get paged”. By using MOC, we can leverage correlation to notify the end user once. While the issue is being triaged, additional alerts can be added to the existing incident. This generates one call-out, even during an incident, that has a fallout beyond the initial page.  

By doing this we are reducing the need to constantly be on guard to acknowledge a page for fear of it rolling when you really should be fighting the existing fire.  Likewise, you can be assured that the page you are receiving is not related to the page you received 15 or 20 minutes ago. 

Q: I am a UX designer and my ears picked up when you said that SRE is an advocate for the customer, for having a good experience. I’m wondering if you could speak to your role as an advocate for customers and some success stories there?

One thing that we do here at Moogsoft is having weekly design sessions with our UX team specifically. We drive changes to the look and feel of the app based on our experience. As we will be the ones using it day-in and day-out we want to feel at home, and call out anything that immediately doesn’t make sense, or may give a user who is fresh to the product pause.

The primary way we advocate for customers is by being one ourselves. We are constantly pushing back on the product teams in how things function. Namely, if you look at your Create Your Own API there have been major revisions to simplify that workflow since we started onboarding our own tools with it. Things like the ability to iterate through arrays embedded in a JSON body, or conditional mapping that allows you to map fields from a pool rather than one to one. These changes were expressly driven by our need internally, as well as a need to simplify the process for external customers. 

Coffee Break with Helen Beal will continue to explore the day-to-day lives of DevOps, ITOps and SRE pros with new sessions bi-weekly through June. The next session (April 8) will focus on more ways intelligent observability impacts the lives of DevOps teams with a look under the covers of AIOps. RSVP now to save your seat.

Moogsoft is the AI-driven observability leader that provides intelligent monitoring solutions for smart DevOps. Moogsoft delivers the most advanced cloud-native, self-service platform for software engineers, developers and operators to instantly see everything, know what’s wrong and fix things faster.

About the author


David Conner

David is Moogsoft's Director, PR and Corporate Communications. He's been helping technology companies tell their stories for 15 years. A former journalist with the Sacramento Bee, David began his career assisting the Bee's technology desk understand the rising tide of dot-com PR pitches clouding journalists' view of how the Internet was to transform business. An enterprise technology PR practitioner since his first day in the business, David started his media relations career introducing Oracle's early application servers and developer network to the enterprise market. His experience includes client work with PayPal, Taleo, Nokia, Juniper Networks, Brocade, Trend Micro and VA Linux/OSDN.

All Posts by David Conner

Moogsoft Resources

August 5, 2022

GigaOm Radar Report

August 3, 2022

Episode 6: Mooving to… Real release strategies with Jake Laverty

July 27, 2022

Moogsoft Green Credentials

July 18, 2022

Don't Let Outages Ruin Your Reputation — Prevent Them With AIOps