If you look around the industry today, it’s obvious that DevOps has become more than an idea. It’s now how most of us do business. This change has brought us a lot of great tools that have made developing products and deploying them to production much simpler.
Pipelines that were once available only to large companies with the budget and teams to build them are now available to everyone. The result: an industry that is capable of making changes to production on an hourly basis, where daily changes would have been risky in the past.
When automation fails it still requires a human to put the pieces back together. This requires a deep understanding of the architecture, recent changes made to this architecture, and the backend code that drives everything.
This shift hasn’t been all roses for everyone. One camp feels that the evolution of DevOps has been focused too heavily on development, and that operations hasn’t enjoyed an equal share of the improvements. The apparent limelight on developers makes sense from a business perspective, with most companies considering developers as money generators, and operations as a money pit. We have enabled developers to move faster and to keep customers happier, as well as releasing features faster and making sure that we can continue to improve products faster than competitors.
When Automation Falters, Humans Save the Day
With this rapid change model, the question becomes how quick are we able to respond when something goes wrong? Or how do we even know something is going wrong? We have new tools that can gather data at a very low resolution, and we can feed this data back to a dashboard. The problem is that when automation fails it still requires a human to put the pieces back together. This requires a deep understanding of the architecture, recent changes made to this architecture, and the backend code that drives everything.
Let’s also not forget that code isn’t the only thing that is changing rapidly. With more and more companies deploying to the cloud, infrastructures during big spikes in traffic, for example, may spin up for less than a minute to balance load. That’s another way of saying that a server is no longer a computer sitting in some dark datacenter.
Some designs may have even gone serverless where your runtime environments become more ephemeral. It’s no longer required as part of your architecture to have additional servers waiting for load to spike. You can react to the load in real time, and your automation and metric collection need to be able to handle the ever-growing and shrinking nature of this design pattern.
With this shift in infrastructure comes the ability to stop treating our systems as pets and start treating them as cattle. Pets require care and feeding. When they have a problem you take the time to nurture them back to health. Cattle, on the other hand, can usually get by on their own. They have the capacity to feed themselves, and can handle themselves through most common ailments. When there is an issue we shouldn’t need to troubleshoot the problem. Rather, our systems should be able to tell us what the problem is.
Human Capabilities Augmented by AI
This is where IT anomaly detection, IT monitoring, and IT notifications come in. With these systems in place we can get updated information on the health of our systems and know why they are failing. Any automation for restarts/rebuilds is great for quick fixes, but this data allows us to prevent future issues. These fixes could be an issue in our code, or even in our own infrastructure design. By having our systems notify of us with this information we can get right to fixing the root problem. This saves hours of an engineer’s time of having to research the root cause, as all the data is already gathered for you.
In a lot of design patterns many services rely on yet more services. This can mean when something does go wrong we see a cascading effect of impact. When one service is impacted something upstream and downstream will also see a degraded state if not outright downtime.
All of these complications may mean that the humans we have behind the wheel aren’t capable of keeping up on their own. We need to augment their abilities with machine-learning algorithms and AI. By allowing machines to learn and recognize anomalies in our systems we can allow operations to dig into why rather than where. From there they can correlate impacted systems into a single event. This allows us to look at the root of the problem while being aware of what other impact it has caused.
Once the dust has settled and operations has fixed the ship, this data can then be leveraged to make better architecture decisions. Why did this happen? How can we prevent this from impacting us in the future? By leveraging AI to enhance our teams we can enable them to recognize and fix problems more quickly.
And this puts operations in the limelight.
About the author Thom Duran
Thom Duran is a senior cloud engineer at Moogsoft. He fell in love with technology at a very young age after playing Wolfenstein and realizing people could make these things. This love has turned into an obsession with simplifying building and deploying tech stacks, allowing a more diverse crowd to create things others will love.