Last month I attended the Monitorama Conference in Portland, Oregon and got to listen to several talks by some of the most DevOps-y folks in the world. Some of the common points made during the sessions I attended were the challenges associated with collaborating across teams in complex environments.
Shifting from Monolithic to Microservices architectures has clear benefits — change-tolerance, fault-tolerance, decentralization, easily scalable, easy to integrate, and many other pleasant terms come to mind. On the flip side, Microservices make collaborating across an entire organization more complicated. Just think about the construction of a new application. Each new application typically requires a new team with a particular responsibility. As the number of Apps and the sophistication of your Microservices grows, you will likely create additional teams that work independently.
You can’t communicate with everyone. Building more notification and escalation rules aren’t going to help because teams are going to change, applications are going to grow, and new failure scenarios will arise. Getting better context across your monitoring data will help.
Today, you might have an entire network of applications that interact with each other in various ways, based on actions that get triggered. Issues that span multiple services are not only difficult to troubleshoot, but also require communication and collaboration across multiple teams. So how do you cope?
Something Broke, Is It You?
Just communicating across an organization is pretty easy. Most DevOps organizations use tools like Slack or PagerDuty to send messages and notify people about issues. These work quite well in small environments, where the number of suspects is manageable. But what happens as Microservices grow across an enterprise?
Firstly, issues become complex, and it’s unclear who is responsible. PagerDuty can send notifications, but those notifications get escalated because issues aren’t understood and haven’t occurred before. Secondly, the number of potential suspects to notify becomes so large that collaboration is essentially impossible.
Bryan Liles, of Capital One, best explains this phenomenon in an excellent talk at Monitorama this year. In his talk, he compares an enterprise to a complete graph. In a complete graph, the dots are vertices, and the lines are edges.
You can calculate the number of edges through the following formula:
Where ‘n’ is the number of vertices in the complete graph.
Liles uses the vertices in the complete graph as an analogy for the different monitoring teams in an organization, and the edges as an analogy for the various conversations that need to happen when an issue occurs. Leveraging the formula, you can figure out the number of conversations that need to happen depending on the number of teams in your organization.
- With five teams, ten conversations need to happen.
- With ten teams, 45 conversations need to happen.
- With 20 teams, 190 conversations need to happen.
- With 100 teams, 4,950 conversations need to happen.
- For the modern enterprise with perhaps, 1,000 teams, there are now 499,500 possible conversations that need to happen.
Long story short, you can’t communicate with everyone. Building more notification and escalation rules aren’t going to help because teams are going to change, applications are going to grow, and new failure scenarios will arise.
Getting better context across your monitoring data will help.
Better Context Means Smarter Collaboration
You can improve context in many ways — increasing alerting quality, service mapping, or even knowledge articles, but the most powerful results I’ve witnessed come from two approaches:
I’ve spoken with too many Fortune 500 IT organizations to count (well, it was under 500), and they’ve all told me that their IT alerts are over 90% noise. Imagine trying to collaborate with the right teams across an enterprise IT organization when 90% of your context is nonsense.
By simply deduplicating alerts through an event management tool, you can increase the signal-to-noise ratio dramatically. However, we at Moogsoft take it a big step forward, by introducing unsupervised machine learning through a method known as Entropy.
This technique essentially ranks streaming alerts in real time, in order of significance, and throws out those deemed as insignificant. In a recent Proof-of-Value with an enterprise Healthcare provider, entropy alone reduced their alert volumes by 25% while delivering improved context through the remaining alerts.
Connect the dots through Correlation:
By leveraging algorithms that can interpret the attributes of streaming alerts (language, timestamp, location, prior occurrences, etc.), organizations can achieve tremendous context by investigating clusters of alerts as opposed to individual alerts. And when algorithms accurately cluster a group of alerts, the scope of the issue should be clear, and therefore, determining the teams that need to communicate should be very clear. When the correlation is effective, instead of 100 people communicating, you might only need five people.
With Moogsoft AIOps, the correlation algorithms consistently deliver impressive results at leading enterprise IT environments. GoDaddy, for example, was able to reduce internal call-outs by 66% in less than two months of using Moogsoft AIOps.
The goal of collaborating across an IT organization is to get the right information to the right people to resolve issues as quickly as possible. Doing this without bothering everyone else in the organization is now a serious challenge in an enterprise environment.
Something tells me that the complexity of managing microservices and communicating across teams is only going to increase. So it’s time to investigate how new technology can be used to automate what humans can no longer handle.
About the author Sahil Khanna
Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.