Experiencing Macro-Impact from Micro-Changes

Tuesday April 11 2017 | Mike Silvey

The unpredictability of multiple micro-changes in cloud environments, and why fault tolerance does not mean zero incidents.

Experiencing Macro-Impact from Micro-Changes

Fault tolerance is now core functionality for modern cloud architectures. The benefit is full automation in orchestrating workloads across enterprise infrastructures, as well as a reduction in careless faults, like running out of CPU.

Fault tolerance, however, does not mean zero incidents.

Cloud infrastructures are self-healing to a certain degree. Faults, like deviations in capacity or performance of individual attributes, can cause load balancing and potentially thousands of “micro-changes” per day. The misconception that needs to be clarified is the impact of these micro-changes.

While single faults are imperceptible to application end-users at times, the compounded impact of multiple micro-changes can, and often times does, impact service.

Service impact is caused by a landslide of micro-changes. Salesforce’s outage last year as well as AWS’ issue this year were really due to micro-changes without oversight.

Tweet Section

For large-scale enterprise data centers or cloud service providers, thousands of micro-changes can happen in a single day. They are unpredictable, off the radar of application and infrastructure teams, and can have serious impact.

What is a Micro-Change?

A notable difference in the past and present of IT is change. In the past, change was planned and infrequent. Single changes could introduce failure conditions, which is why we used to look for the single root cause. Because of the stickiness of IT processes, some people still do. However, in today’s software-defined world, changes are unplanned and unpredictable. These changes occur constantly and typically go unnoticed because they occur on a micro level.

A micro-change could be a VM being moved to an alternate Hypervisor in order to balance load. Alone, the impact is minimal or even non-existent in a fault-tolerant cloud architecture.

For large-scale enterprise data centers or cloud service providers, thousands of micro-changes can happen in a single day. They are unpredictable, off the radar of application and infrastructure teams, and can have serious impact.

Failure vs. Unpredictable Behavior

The A380 airliner has 4 engines, but it can take off and land with 2. The likelihood of all 4 failing is low — but fault tolerance is there. If the plane does lose thrust capacity, it could have impact on other functions, like flight efficiency.

Another example closer to the ground is the experience of the BMW i3 range extender. The i3 can do 100 miles on it’s battery. Then, to offer some fault tolerance (read: range anxiety protection, etc.), it has a range extender engine. However, when running on the range extender to charge the battery to power the electric motor,  the car runs in limp mode, traveling at a max of 45 MPH. Imagine the consequence of that happening on the highway.

The Consequence of Change is Unpredictable

I’ve been approached by hardcore cloud enthusiasts who assert the belief that there will never be any incidents because the systems are self-healing. The argument goes something like “constant performance monitoring means that deviations in capacity are detected and the system re-orchestrates itself to cope.”

There truth is that there is always a consequence to a change. The question is, how big?

Single micro-changes may not have significant impact. For example, a micro-change where a VM is moved from one hypervisor to another in order to balance load may not be perceived by Application or Service users. However, multiple micro-changes at the same time can lead to adverse application behavior and user experience. And if these micro-changes occur at the same time as network load latency issues, for example, the application impact will be more serious.

In the above picture, this highway was built with 4 lanes, designed to cope with peak load capacity. However, when someone decides that the lane to the left appears to be moving faster (hint: it’s not) and tries to switch over, everyone slows down.

Now imagine 100 others cars also changing lanes at that same moment. The the consequence is congestion. This sudden congestion could lead to someone carelessly bumping into a merging car, creating a 29-mile tailback.

Similarly, end-user experience and quality of service constantly face impact from micro-changes.

Traffic behavior cannot be modeled. Cloud infrastructure and Application topologies change all the time with hundreds, if not thousands, of micro-changes per day. But what is constantly neglected is the fact that it’s not possible to model behavior that you haven’t experienced before. This means that Cloud operators and Applications operators are at risk for the consequences of the interactions of multiple micro-changes.

Algorithmic Intelligence gives you Real-Time Insight

Unchecked faults and micro-changes lead to degraded capacity, which leads to non-deterministic behavior. We see this occur over and over again. Fortunately, a solution does exist. Algorithmic intelligence built for IT data is a necessary solution to gain real-time insight into risk of application impact.

Moogsoft AIOps helps modern IT Operations and DevOps teams become smarter, faster, and more effective by providing technological supplementation that automates mundane tasks, enables scalability, and frees up human beings to do what they do best — ideate, create, and innovate. Start your free trial today by clicking here.

Leave a Reply