Moogsoft’s expert team has convened a DevOps post-mortem on behalf of Ingen, Inc and Jurassic Park. Sure, the park ran on “a UNIX system,” but where was their observability system? Did they really “spare no expense”? Read on for Jurassic Park's top 7 DevOps missteps.
Top 7 DevOps Antipatterns at Jurassic Park
Jurassic Park is many things: a one-of-a-kind zoo, a theme park of the future, a groundbreaking science lab, an investment vehicle, and a luxury resort. What it is most decidedly not is a pinnacle of good DevOps practices, culture, or respect. Multiple critical oversights led not only to the attempted theft of company property, but a litany of human and dinosaur deaths. Even without the illicit intervention of its chief software engineer, Jurassic Park’s software engineering missteps would have led, sooner or later, to the demise of the park.
After careful consideration by Moogsoft’s expert team of engineers and business analysts, the following points elaborate the problems in Jurassic Park and their potential solutions:
7. Lack of staffing
Hammond repeatedly claims to have “spared no expense,” but he clearly did, at least when it comes to engineering personnel. Nedry gets into an argument with Hammond over his pay. We’re led to believe that Nedry is simply greedy, however Ingen has only one administrator with root access. In addition to the “hit by a bus” and the more innocuous “win the lottery” mantras, perhaps we should add “attempted to steal company property and was gruesomely killed by a Dilophosaurus”. Arnold (Samuel L Jackson) even directly admits to Hammond: “I can’t get Jurassic Park back online without Dennis Nedry.” And it’s not for a lack of using popular technology: Lex, a preteen girl, is able to ascertain that “It’s a UNIX system!”, albeit one with a one-of-a-kind animated 3D file browser, and re-engage the locking mechanisms in the Visitor Center. InfoSec issues aside—after all there were no park employees left alive by this point—her ease of access points to a critical lack in staffing. If it is so easy to learn, why not have more trained engineers, or even a preteen intern or two, ready for critical situations? That way, even if one or two are eaten by velociraptors, there is still enough staff to respond to incidents.
6. Lack of resources
As a tropical storm approaches the island, we learn that Ingen hasn’t even provided Nedry the necessary “compute cycles” and “memory” to debug the tour program and maintain all of their other systems. And after “debugging the phones,” Nedry says the “system will be compiling for 18-20 minutes” during which time “some of the minor systems might go down.” Even if his intentions were good, Nedry would have had to juggle the park’s servers, only being able to either maintain a functioning park, debug the software, or compile new code--never all three.
5. No postmortem plan
Once the park has fallen into chaos, Hammond claims, “We relied too much on automation.” Automation didn’t destroy Jurassic Park, and as for what did, that should be left up to a cool headed meeting of stakeholders and engineers. Instead, the company’s founder wistfully stares off into the distance, tears welling up in his eyes as he nervously paws his cane. This isn’t just a lack of a postmortem template, this is a downright toxic way to analyze and remedy flaws within the company’s process.
4. Lack of DevOps culture
Though Arnold is understandably frustrated by Nedry’s likeness waving its finger in his face and taunting, “Ah ah ah, didn’t say the magic word,” it speaks to a lack of respect for DevOps among Ingen employees that he derides Nedry’s creative if illegal program as “hacker shit.” Though not addressing engineering personnel directly, one can easily imagine Malcolm's frantic catchphrase “Must go faster!” being uttered to Nedry as the project moved along.
3. No manual or automated testing
Nedry is able to push his changes directly to production without any testing whatsoever. This allows him to make the excuses for the security failures that allow him to steal the dinosaur DNA and attempt his escape. Code reviews, automated testing via continuous integration, and verifiable manual test cases would likely have caught his changes, or at the very least forced him to find a less systematically disruptive way of stealing the DNA.
2. Infrastructure is all rolled by hand, and not committed to a VCS
When Arnold finally does start to debug exactly how Nedry’s attack worked, he determines that the “keystrokes” were not captured, and therefore he will have to go through 2 million lines of code. Working with a complex architecture can be challenging, but so long as the infrastructure exists as an easy-to-understand chunk of code, reverting changes should be as simple as checking out the repository at an earlier point in time and redeploying. For the business logic itself, finding the place where malicious code was inserted should be trivial, if a version control system is used. The only option they see available is to “shut down the entire system.” In an emergency, sometimes drastic measures are needed, and kudos to Hammond for having the idea. However, Arnold opposes this idea as they have “never shut down the entire system before.” Were their entire system specified as a k8s deployment, they would have no such nervousness.
1. Limited continuous monitoring, observability, and AIOps
Arnold is looking at a “Control Room / Plan View” that probably works great for one-off problems, but for larger problems he is simply faced with a screen of red “UNLOCKED” or “CHECK” warnings that leave him totally lost as to what is going on. We know these “glitches” were correlated across time, but that’s only one small part of the puzzle. With an AIOps solution, the alerts could have also been correlated by the acting user and whether they were unlocked remotely or locally. Arnold’s health would have thanked him, as he would have smoked at least 10 fewer cigarettes, and possibly avoided being slashed by a velociraptor. The problem only spirals more out of control, as the dinosaur fences begin to fail throughout the park. At this point, Arnold is sent by Hammond to “find Nedry”… a paging system should have been ringing his, eh, beeper long ago, or there should be a clear on-call schedule to handle outages. To check a critical piece of infrastructure—the raptor fences—Arnold has to manually type a command into a terminal. If this information is so critical, when it fails it should be bubbling to the top of the pile with a CRITICAL severity, and the engineers should have faith that their system would alert them in the event of a failure.