The Business Case for Observability and Site Reliability Engineering
Charles Araujo | January 20, 2022

Unlike traditional IT Ops, the role of the SRE isn’t simply focused on finding and solving technical problems. The big win for today’s SREs is supporting the organization’s strategic innovation initiatives. With the appropriate observability capabilities, it’s possible to quantify the value that software infrastructure contributes to this innovation effort.

Unlike traditional IT Ops, the role of the SRE isn’t simply focused on finding and solving technical problems. The big win for today’s SREs is supporting the organization’s strategic innovation initiatives. With the appropriate observability capabilities, it’s possible to quantify the value that software infrastructure contributes to this innovation effort.

Throughout this series, we’ve been exploring the interplay between the discipline of Site Reliability Engineering (SRE), the role of the Site Reliability Engineer (also SRE), and observability. We examined the meaning of adopting an SRE discipline, how observability differs from monitoring, and the role of automation in adopting observability.

But all of that is really a preamble to the heart of the matter: why should SREs adopt observability?

The answer is found in the long, tumultuous history of the IT function. As much as it’s tempting to see cloud-native, DevOps, and observability as wholly new endeavors, the reality is that they are the latest chapter in a story that seems to repeat itself incessantly.

Every so often, IT goes through a period of rapid innovation. And almost as fast, the need for business-critical systems to be reliable, available, and performant begins to tamp down on that innovation — until no innovation is happening at all. And then the cycle starts again.

So, as fabulous as the innovations that approaches like cloud-native and DevOps engender may be, they will be short-lived. The need to maintain reliability, availability, and performance will eventually crush the culture of innovation that organizations seek to sustain.

Unless, that is, you do something to break this cycle.

That something is why observability — and the role of the SRE — represent such a dramatic shift and opportunity for organizations that seize on it.

Why the Role of the SRE is Different

As we've covered extensively throughout this series, one of the most striking differences between the role of the SRE and the traditional role of IT Ops is the focus on the totality of the service experience, rather than on the mere maintenance of an operational state.

The SRE discipline and role grew out of a recognition that managing reliability, availability, and performance needed to happen from a service rather than systems perspective.

Moreover, the role was made a first-class citizen in the end-to-end continuous integration and deployment process. As a result, SREs are much more comfortable playing an active role throughout the entire application development and deployment lifecycle — and this fact is critical to the essential role that they and observability play in sustaining an organization's innovation culture.

The traditional role of IT Ops looked at new applications or changes to existing applications as an operational burden. They introduced change — and change inevitably impacted the ability of IT Ops to maintain a ready state.

While the role of the SRE is ostensibly the same — at least on the surface — its strategic and integrated posture fundamentally shifts the focus.

Being able to affect and positively impact reliability, availability, and performance throughout the development process transforms the cultural posture of the entire organization. Moreover, the focus on the total end-user experience, rather than a systems-centric view of the operational state, gives the SRE a different, more strategically-aligned viewpoint.

Breaking the Protect-at-all-Costs Paradigm with Observability

This cultural positioning of the SRE offers the possibility of breaking the innovation-crushing cycle of the past.

Whereas traditional IT Ops saw nothing but downside from any potential change to the technology stack and, therefore, continually worked (directly or indirectly) to slow the rate of change (and, therefore, innovation), the SRE’s cultural positioning and mandate shifts incentives.

SREs are invested in protecting and enhancing the end-user experience of an application. As such, they share the development team's desire to make changes that will deliver on the promise of a better experience — thus breaking the cycle that has historically stifled innovation.

The challenge, of course, is that the SRE’s mandate is to balance this tension between the need for change and the need for reliability, availability, and performance. And this, as we’ve covered, is where observability plays a vital role.

The visibility, line-of-sight into the unknown unknowns, and the ability to leverage operational data to identify the cause of negative experience impacts that observability offers are the essential enablers that allow the SRE to strike this balance continually.

Moreover, the ability to leverage this vast amount of operational data, contextualize and enrich it, and then build automation around it suddenly makes the job of the SRE tenable.

Without these powerful, data-enabled capabilities, the SRE would be stuck between the same rock and a hard spot as IT Ops, unable to focus on anything other than maintaining the operational state. The combination of the SRE's cultural positioning and the data-first posture that observability provides enables the SRE to finally break the protect-at-all-costs paradigm that has been IT Ops' mandate and which has repeatedly and reliably limited innovation.

The Intellyx Take: A Clear Business Case

When most enterprises think of a business case, they’re looking at one thing: return on investment (ROI).

Traditionally measured, the adoption of observability can undoubtedly deliver an ROI. Merely shortening the duration of an outage can reduce costs enough to provide a return that exceeds the observability investment.

But the real business case for observability (in combination with SRE discipline) is a different ROI: return on innovation.

The powerful intersection of an SRE's cultural positioning with the data-centricity of observability offers enterprises the opportunity to sustain the cultures of innovation that they have worked so hard to develop over the last several years. Doing so presents an opportunity to reap a significant return on investments already made.

The instantiation costs of adopting disciplines, approaches, and architectures such as cloud-native and DevOps have been high as organizations have had to overcome learning curves, acquire new technologies, and migrate projects. These efforts are now finally bearing fruit as organizations begin to scale them.

But as you do, these efforts are transitioning into core, business-critical functions — and thus increasing the risk that the need to protect their operational state outweighs the need to sustain innovation.

Therefore, the real return on investment in observability within the context of an SRE is that it counters that risk and enables the SRE to maintain balance and protect the culture of innovation you've worked so hard to maintain.

Copyright © Intellyx LLC. Moogsoft is an Intellyx client. None of the other companies mentioned in this article are Intellyx clients. Intellyx retains full editorial control over the content of this paper.

Moogsoft is the AI-driven observability leader that provides intelligent monitoring solutions for smart DevOps. Moogsoft delivers the most advanced cloud-native, self-service platform for software engineers, developers and operators to instantly see everything, know what’s wrong and fix things faster.
See Related Posts by Topic:

About the author


Charles Araujo

Charles Araujo is an industry analyst, internationally recognized authority on the Digital Enterprise and author of The Quantum Age of IT: Why Everything You Know About IT is About to Change. As Principal Analyst with Intellyx, he writes, speaks and advises organizations on how to navigate through this time of disruption. He is also the founder of The Institute for Digital Transformation and a sought after keynote speaker. He is a regular contributor to and has been quoted or published in Time, InformationWeek, CIO Insight, NetworkWorld Computerworld, USA Today, and Forbes.

All Posts by Charles Araujo

Moogsoft Resources

May 5, 2022

More Tools + More People = Increased Complexity

April 26, 2022

Continuous Availability vs. Continuous Change

March 24, 2022

Continuous Availability: How It’s Changed, and Why It’s Critical

February 15, 2022

AIOps in 2022 and Beyond: A Conversation with Gartner