One of the great things about DevOps methodologies in action is that it builds upon the enthusiasm of smart and motivated people who are all interested in working together to minimize downtime and ensure continuous innovation. It’s all about maintaining a sticky and reliable experience for both end-users and customers.
Monitoring tool vendors recognize this, and are trying to respond to DevOps’ needs. For example, I applaud AppDynamics for their efforts to encourage collaboration around an AppDynamics related fault or performance threshold exception alert. However, in most environments, Application Performance Management (APM) like AppDynamics is only a partial source of information to ensure application and service quality. Its scope is based around isolated support domains and it demands a ‘linear’ process – that is:
- Review Alert
- Assess application performance
- Review log files
- Run a diagnostic
- Can’t resolve? Then escalate!
Escalating a ‘single Alert’ now becomes an actionable work item to a more experienced person.
Yet such linear processes are inefficient, and often ineffective because it takes too long to troubleshoot complex problems. We’re all working in isolation from each other! We have no Situation Awareness!
Adding to this long-standing inefficiency (which sparked the DevOps revolution in the first place) is the fact that ‘Dev’ teams do not have any view into the underlying infrastructure. This leads to the inefficient investigation of phantom application faults, burning valuable resources along the way. Boom! Now you have: inefficient use of highly skilled people; fire-fighting reactions to issues after they have occurred; and significant and frequently deep business impact.
The point here is clear – you can socialize a ‘single alert’ with many people, but a single alert does not offer Situation Awareness nor does it offer a reason or context to collaborate with others. Situation Awareness comes from having a ‘context’ that is pertinent to me, about which I should be fully aware.
“Situational Awareness” Enables Collaborative Remediation and Real Efficiency Savings
Here at Moogsoft, we started from the premise that Application, Compute and Network Operations teams all need to be “Situationally” aware and have the context of any issue that relates to them. In other words, if there is an issue with some infrastructure component that has impacted an Application, the DevOps teams and the respective infrastructure operations teams are now both stakeholders of that Situation.
By providing context to the Situation (the alerts which indicate both causality and collateral impact), the appropriate parties can more quickly diagnose whether they need to respond now to guarantee business continuity. This significantly reduces time wasted on ‘spam’ diagnosis and needless forensic investigation, while at the same time enabling proactive notifications to impacted customers and end-users.
There is yet another important fact here to consider. Today, most DevOps teams support instrumentation based around the principle of detecting ‘Performance’ (or Time Series) deviations. If performance or capacity trends deviate dramatically from an historic baseline, it points to something wrong with the Application, and support resources react accordingly.
At Moogsoft, we have a somewhat different perspective on how this should work. Modern infrastructures (whether Public Cloud Infrastructures of elastic compute like AWS or Google, or Disruptive Enterprise Infrastructures combining legacy, outsourced and elastic compute) cannot be modeled. Topological and performance behavior baselines are elusive and unachievable goals.
Also, performance behavior typically deviates due to ‘Black Swans’ of various kinds. Ignoring such Events is kind of like driving down the Freeway at 100MPH (not that I’d do that of course!) with your eyes closed for 5 seconds and then open for 5 seconds (or texting while you drive I guess – again, something else I do not advocate!).
With this in mind, Incident.MOOG is built around two main guiding principles:
- Anomalous behavior should be detected without a preconceived detail of the anomaly. In other words, Moogsoft believes that one needs to treat the world of modern IT Infrastructures and DevOps in a manner that elicits a Donald Rumsfeld press briefing: We should realize that we are generally blind to the dynamic behavior of modern IT and so we need to be able to detect the “Unknown Unknowns”.
If we can detect anomalies that are “unknown unknowns”, we can also detect the “known unknowns” and the “known knowns” as well. This means no models. Models need to be maintained. Maintenance of models is impossible to keep current as Disruptive Enterprise Infrastructures scale.
- Anomalous IT behavior often results from complex dependencies and will concern multiple parties or domains of support. This means that all the stakeholders to an anomaly should be Situationally aware of their relationship with that anomaly as soon as its context is determined.
The ideal approach to resolving such anomalies is with a collaborative remediation environment – what we call the “Situation Room” – a virtual incident workbench with a Facebook ‘wall’ type concept for ensuring that all the stakeholders to an issue are Situationally aware and that specialized domains of support can collaborate to efficiently resolve the issue.
If you want collaboration, you need a reason to collaborate. Incident.MOOG automatically detects when forms of anomalous behavior are occurring, then ‘push notifies’ the anomaly to the appropriate stakeholders, who can then diagnose and resolve the issue in a collaborative remediation environment.
Incident.MOOG does this by using real-time “data-centric” methods that do not rely on outdated and inaccurate behavior models, topology and rules to inform the correct stakeholders that there is a service-affecting situation that they need to collaborate around. Incident.MOOG contextualizes their relationship with that situation – e.g., Am I the owner of the causal indicators, or am I the impacted party?
Incident.MOOG’s approach is inclusive, providing a top to bottom view across the entire stack. As a result, Moogsoft’s customers – which range from the largest Internet portals to global banks and cloud service providers – are showing more than 60% reduction in actionable work items and are receiving earlier warning signs of occurring situations hours (and sometimes Days!) before their previous processes.
More importantly, DevOps teams are now able to reduce their reactive responses inside war rooms and become proactive with their end-users and customers, rather than reacting after the customer calls to complain (when the disruption has already occurred!).
So, how good is your Situation Awareness? Let Moogsoft help you embrace Situation Awareness for greater operational collaboration across Dev, Ops, and DevOps.
About the author Mike Silvey
An expert in IT operational management and technology commercialization, Mike launched SunNet Manager in the UK for Sun Microsystems before founding an open systems service management business at Micromuse where he brought several innovative service management tools into the European market (such as Remedy) and established key OEM relationships (Cisco, HP, Intel) that led to successful IPOs for both Micromuse and RiverSoft. Today, Mike is focused on and scaling Moogsoft by overseeing strategic business relationships with key partners around the globe.