Chapter Nine: In Which Dinesh Experiments with Chaos Engineering
Helen Beal | July 30, 2021

This is the ninth chapter in The Observability Odyssey, a book exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this chapter, our SRE, Dinesh, plans and executes his first experiment in chaos engineering and shares his learnings.

This is the ninth chapter in The Observability Odyssey, a book exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this chapter, our SRE, Dinesh, plans and executes his first experiment in chaos engineering and shares his learnings.

Another day, another drama! This one, though, is very much of my own making. I have been wanting to try my hand at a bit of chaos engineering for some time now but C&Js just hasn’t been ready. Sarah’s been up for it too, though, at Animapanions. And now that our CIO, Charlie has seen MTTR drop across every single technology team, thanks to the rollout of Moogsoft and the new incident management system (kudos to James), it’s pilot day.

Talking of pilots, my old man is a retired civil aviation captain and I’ve found it really useful to talk about the work he does when trying to convince people that chaos engineering’s a valuable practice. I was chatting with James’ nemesis, Jeff, in the finance department the other day. The conversation went a bit like this:

“Um. So I really don’t understand why I would break the thing I spend my entire life trying to keep running, on purpose.” Jeff did look genuinely baffled to be fair. It was clear that some serious neural pathway reengineering was on the cards.

“There are a lot of similarities between aviation and information technology. They are both highly complex environments and when something inevitably goes wrong, it can be catastrophic and is rarely the cause of a single factor.” I knew that chatting with Jeff about airplanes was going to catch his interest. He’s into aerobatics in a big way. But, let’s face it, everyone wants to know how to be safe when we fly.

“That’s why we work so hard to maintain our aircraft. And our systems,” Jeff said, wisely, I will acknowledge.

“True, that. And I’m not diminishing the importance of the constant monitoring and updates we do,” I said, not mentioning the recent outage James had told me about where a severity one was caused by an unpatched firewall. “But this is next level. Belt AND braces if you like. When I was growing up I remember asking my dad why he spent so much time in the simulator. Sometimes it seemed like he was in there more than he was in the air.”

“I do like Blue Angels,” said Jeff, a dreamy look in his eye as he, no doubt, recalled his latest gaming bout. We were both big gamers.

“I pre-ordered the new Flight Simulator last week,” I said, getting slightly off track.

“No way! Me too. London Heathrow here we come!” At least we seemed to be bonding.

“So, my father said to me that every time he was in the simulator, he practiced what to do when something went wrong, so if it went wrong in the air, he’d know how to fix it.”

“Oh,” said Jeff, fiddling about on his phone. I guessed he was googling. “Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production,” he quoted. I see what you’re getting at.” Say what you like about Jeff, he’s quick. “When did you last try this?”

To be fair, I can understand why he didn’t want to be my guinea pig. He does have one of the most notoriously unstable systems in the company. Sarah, on the other hand, has her MTTR at its lowest ever, and the lowest in the company thanks to Moogsoft and her DevOps toolchain. She’s also practicing a new swarming technique and looking for ways to tackle the technical debt in her own system as she still does get a surprising amount of incidents.

Chaos engineering isn’t just a fire drill - it’s also a way of finding underlying problems and technical debt and fixing them before they cause an incident.

Sarah had her own, pertinent, analogy for chaos engineering that I enjoyed too. She said she thinks of it like a vaccine, where you inject yourself with a small amount of a potentially harmful foreign body to build resistance and prevent illness. She said it’s a tool we use to build immunity in our technical systems by injecting harm (like latency, CPU failure, or network black holes) to find and mitigate potential weaknesses.

 

 

 

She arrived at her desk, surprised, but I think pleased, to find me waiting in her chair.

“Excited, much?” she asked, handing me the tea she’d picked up for me. I like Sarah.

“Absolutely! This is going to be great - a game-changer.”

“I know! The swarming team is assembling. The chaos tool is running. Moogsoft is ready and tuned. We have our hypothesis and experiment defined. The ‘what could go wrong?’ exercise is complete.” Sarah waved at the whiteboard we’d all been working on yesterday. “I think we’re ready to go.”

“And we’re experimenting with a ‘known known’ today?”

“We’ll start there, yes. We need to start with something we are aware of and understand. Then we’ll move onto the things we know about but don’t really understand and work our way up to things we don’t know about and don’t understand.”

“That sounds sensible.” We grinned at each other.

“Ok,” Sarah said, calling her team together in the manner of a daily stand-up. “Rachel, please can you remind us of the experiment we decided to run today?”

Rachel is Sarah’s lead support engineer and used to work in the network team in C&Js. You can see how much she’s blossomed at Animapanions. She’s been given space to grow and is rarely constrained by the bureaucracy that used to get her down.

“Today, we’re going to test the hypothesis that the network is not reliable. We know this to be true, not just because we’ve observed it many times ourselves,” she paused as her teammates chuckled, “but because one of the fallacies of distributed computing is that the network is reliable.”

“Thanks, Rachel,” said Sarah. “This is an example of one of those dependencies that are out of your control, and sadly always will be. We can’t follow our DevOps mantra of ‘don’t manage dependencies, break them’ here so we must understand how the system reacts when it is unavailable and do our best to shield our customers.”

“Totally,” said Rachel. “So we’re going to do a network black hole chaos experiment that will make the designated addresses unreachable from Animapanions. Once we've applied the black hole, we will check if we can start up normally and serve customer traffic without the dependency. We’ll also be keeping a close eye on Moogsoft to see what else it can tell us about what’s happening.”

“Exciting stuff!” said Sarah. “What results are we expecting?”

“We think the traffic to dependency goes to 0 (or gets slow), startup completes without errors, application-level metrics in steady-state are unaffected, traffic to fallback systems shows up and is successful, dependency alerts and pages may fire.  We are scoping this to a single instance.”

That’s pretty much what happened. It looked like a switch was gone and the SDN rerouted successfully. We could see it all in Moogsoft.

Rachel had baked us a cake for the retrospective.

“That was all good,” said Sarah as she sliced into it. “Remember though, success and failure should both be celebrated. Failure is a learning opportunity. As long as we protect ourselves from catastrophic failure that impacts our customers and harms our business and ourselves, whatever the result, the experiment is useful.”

“Exactly,” I said. “That’s why we limited the blast radius of that experiment and were ready to switch back to the blue environment if we needed.”

“When can we do it again?” asked Rachel.

“What do you want to do next?” asked Sarah. “We could do the network again? Test the latency, packet loss, or DNS? See what happens with DDOS? Or we could do something with resources like CPU, memory, IO, or disk? Or we could go really chaotic and look at the state. Shut down, time changes, or killing processes, anyone? Now, if only there were a way to do this continuously…” said Sarah, thoughtfully.

Recommended resource: Intelligent Observability: Connecting All That Data


Want to learn more?

Register and attend the live webinar Intelligent Observability: Blamefree Retrospectives Tue, Aug 3 at 9am PT | 12pm ET | 5pm BST.

Moogsoft is the AI-driven observability leader that provides intelligent monitoring solutions for smart DevOps. Moogsoft delivers the most advanced cloud-native, self-service platform for software engineers, developers and operators to instantly see everything, know what’s wrong and fix things faster.

About the author

mm

Helen Beal

Helen Beal is a DevOps and Ways of Working coach, Chief Ambassador at DevOps Institute and an Ambassador for the Continuous Delivery Foundation. She provides strategic advisory services to DevOps industry leaders and is an analyst at Accelerated Strategies Group. She hosts the Day-to-Day DevOps webinar series for BrightTalk, speaks regularly on DevOps topics, is a DevOps editor for InfoQ and also writes for a number of other online platforms. Outside of DevOps she is an ecologist and novelist.

All Posts by Helen Beal

Moogsoft Resources

September 14, 2021

Reducing Pages with Alertmanager and Moogsoft

September 8, 2021

AIOps: Time to Sit Up, Observe and Listen

August 31, 2021

Monthly Moo Update | September 2021

August 23, 2021

Chapter Twelve: In Which Dinesh Starts an AI Community of Practice

Loading...