Stability: The Role of Catastrophic Failure in Software Design

In this episode of Mooving to… Stability: The Role of Catastrophic Failure in Software Design, we had the opportunity to chat with Jeff Atwood, yes that Jeff Atwood of, Coding Horror, Stack Overflow, and Discourse (Chief Happiness Officer). Jeff started writing 911 software in Boulder, Colorado for a small company, which was a crash-course in writing code for software that has real consequences. With this unique and deep perspective, B.J. Maldonado and Sean Molloy from Moogsoft, discuss everything from the Chernobyl documentary to how communication is the best programming language possible.

Watch the full episode to learn Jeff’s three recommendations for preparing for disasters and more.

Watch the full episode here.

Chernobyl as a lesson in programming

Jeff’s first piece of advice is to forget about his experiences and go watch the HBO Chernobyl series. The key elements are all there; really think about what it means to fail and how to handle failure.

You learn about the systems that give you safety and then politics get involved, and as soon as politics get involved everything becomes less safe. This series speaks to engineering on every level; about responsibility, what we’re doing, and how it matters to the rest of the world.

Three levels of preparing for disasters

Jeff outlines three tried and true ways to prevent failures as much as humanly possible.

First, understand common failure patterns in software. You’re going to make mistakes, and learning how to adapt to that is the number one job of programming. Understanding all the common failure patterns on most software projects.

A great place to start to understand common failure points is to start with game developments because they’re the most challenging.

Second, how do you fail? Understand the last five times you failed as an organization. Why and how. You need your own playbook for how you deal with failure. Do not make the same old mistakes – make new ones.

Third, what can we measure? What can we run tests on to predict where there may be problems in the future? You want to anticipate future failures as much as possible. Not only how to fix this, but how to prevent it from ever happening again. In some cases it’s simply a setting, and sometimes we just pull the setting.

There are a lot of ways to predict future failures, whether you’re looking at graphs, smoke tests, unite tests, responsiveness tests, is the UI actually there?

Regardless, whenever something happens, you always want to have a postmortem to understand what happened, how it happened, and how quickly to resolve it. If you’re not doing that, you’re really not doing good engineering.

From single point of failure to single platform of failure

It used to be a single point of failure, now we’re looking at a single platform of failure. How can you prevent the single platform of failure, and when is it time to move to a different solution?

Don’t put everything in one region. Don’t overbuild – you don’t need 10 layers of redundancy. Particularly early on, don’t overbuild.

If you’re building a new product, maybe a little risk is ok.

Jeff also recommends the rule of three. Unless the problem has happened three times, it’s not really a recurring problem. This applies to feedback, errors. Areas of code where you have to go in and fix it three times, this area of code is bad.

Gather data about one failure, have a postmortem, but don’t say you’re going to change everything because of this one failure. You don’t want to cause a fire drill where you’re building in incredible layers of complexity and redundancy. This gets expensive quickly. Make sure you really have a pain point before you start making big changes.

Build a culture of respect to weather failures with grace

The purpose of a meeting is to have something happen.

For a postmortem, you’ve evaluated the 5 whys, and you address the root cause and some positive change comes about because of this event and people see this. A lot of people think if you just throw more people at a problem, that the problem will go away. It’s actually the opposite. Adding people to a slow project makes things slower.

Success starts by building a culture of understanding and respect. Listening to each other’s stories and learning from each other.

Start with an understanding that we’re going to push, we’re going to move forward, and we’re going to make mistakes. The key is, how quickly can we course correct, and what can we learn in the process so that we don’t continue to make the same mistakes?

This process allows you to become more competent, more resilient and more robust. Analyzing the failure, making a change, and trying again. Overtime you get a deeper, more nuanced understanding, and that’s the ultimate goal.

Authentic communication

Human communication is the best programming language ever. It’s not about individual talent, it’s about having someone who can coordinate and get all these high-power talents to work together effectively. Coaching matters more than individual talent. You could be a genius programmer and still have no effect on the world whatsoever.

If you’re an effective communicator, you can do anything.

Sign up here to receive notifications for future episodes.

Episode 3: Mooving to… Stability: The Role of Catastrophic Failure in Software Design

Chernobyl as a lesson in programming

Three levels of preparing for disasters

From single point of failure to single platform of failure

Build a culture of respect to weather failures with grace

Authentic communication

About the Author

Product

Resources

Company

Contact

San Francisco

London

Episode 3: Mooving to… Stability: The Role of Catastrophic Failure in Software Design

Chernobyl as a lesson in programming

Three levels of preparing for disasters

From single point of failure to single platform of failure

Build a culture of respect to weather failures with grace

Authentic communication

About the Author

More Resources

Keeping IT Real: Creating a Functioning, Truly Digital Enterprise

The Challenge with Telephone Conference Bridges

What’s in Store for IT Operations Management in 2015?

Entropy & The Science of Noise

Product

Resources

Company

Contact

San Francisco

London