As IT is asked to do more and more, faster and faster, there is no question that automation is the only way to keep up with demands from users and the business. Any situation where a human being is in the loop of delivering on a standard request is an opportunity for that process to be improved. Automating that process will get rid of friction: that dead time between when an operator is sent a request and when they receive it and act upon it, the non-zero chance that even the most conscientious operators may occasionally make a mistake, and the lack of visibility and documentation of what was done.
The problem is a lack of clarity in what to automate. The time saved through automation must be at least equivalent to the time invested in creating the automation and maintaining it over time. Premature automation can be even worse than no automation.
Human operators are incredibly flexible when dealing with corner cases that would stop automated systems dead in their tracks, or worse, cause them to behave in unexpected ways. I discussed some of these failure modes in “The Benefits and Pitfalls of IT Automation”.
New roles are emerging within the IT industry that are dedicated specifically to automation. The result is the same whether they are specifically called “automation architects,” or the job falls under the wider heading of “enterprise architecture,” or whether the automation function falls within a role such as Site Reliability Engineering (SRE). As more and more of the compute and network infrastructure becomes software-defined, and therefore easily software-addressable, it becomes more and more important to think carefully and deeply about how best to take advantage of these new capabilities.
Don’t Limit Your Talent Pool
These days, automation within the context of IT operations tends to be associated closely with SRE. Some definitions of SRE, including Google’s own, focus on the software engineering aspects of automation.
According to Google’s Ben Treynor, “Fundamentally, it’s what happens when you ask a software engineer to design an operations function.” The problem with that definition is that it is incomplete. Software engineering experience is going to be vital to implementing automated functions, and more importantly, that background can help when it comes to looking at a wider process and identifying opportunities to redesign it to leverage automation.
The job of an architect is not simply to build a shelter, but to do so in a way that responds to a customer brief, taking into account the terrain and what might have been done before and might be needed in the future.
However, IT operations as a function also requires deep domain expertise, which “pure” software engineers may not have. For example, the IT support director at a huge European company recently shared with me that it had taken them nearly three years to replace a very senior person who had left the team. The problem was not about skills; while they are not that common, and therefore tend to be quite expensive, people with the right mix of skills and seniority were available on the market. Even once that person had been hired, though, it took them a significant amount of time to get to grips with a codebase that was decades old, had continued being worked on for all of that time, had millions of users arounds the world, and whose infrastructure and operations had in turn evolved around those conditions.
This sort of senior IT operations person may have never held a job with “software engineer” in the title, but I can guarantee that they will have written thousands of lines of code every year as part of their IT ops job, and much of that will still be in active production.
When looking for an enterprise automation architect, don’t discount system administrators. These people are heavily incentivised to automate away routine tasks — if only so they don’t get paged at 3am to respond to to the same issue yet again.
Remember the Human Factors
So if we get a good mix of IT ops and software engineering skills in the mix, we should be good? Perhaps not. There is another major component to the job, but it’s not a technical skill, it’s a human one.
The job of an architect is not simply to build a shelter, but to do so in a way that responds to a customer brief, taking into account the terrain and what might have been done before and might be needed in the future.
Enterprise IT architects are exactly the same: they are the link between the system and its users, able to understand the desired outcome and propose compromises or alternatives that can help achieve that outcome. In the other direction, if there are technical factors that influence capabilities accessible to the users (making new functionality available, or requiring a user-visible change), it is the IT architect’s job to communicate this, including the benefits and any pitfalls.
AIOps can help the enterprise architect to achieve their goals by automating away many routine tasks that could otherwise distract IT staff from taking the more strategic view. When the entire event management console is red, it’s hard to focus on the long term. If automated noise reduction and algorithmic correlation across different data sources can help reduce the events to a manageable number, and ensure that they are real and actionable, IT staff will be better able to respond to requirements from users. One Moogsoft customer saw a ten-fold increase in the productivity of their IT Operations.
The next step is to put these gains to good use, by taking a strategic view and focusing on users’ requirements, instead of being constantly distracted by noise from monitoring systems. The role of enterprise automation architect is key to realizing these benefits in practice.