Enterprise IT Monitoring is more than just a discipline. Monitoring systems are entities that grow organically as organizations that uses them grow. At first, with only a few systems and focused resources, there is little that the supporting teams aren’t aware of. Nevertheless, with the first system failures and initial business impact and drawbacks, monitoring solutions are put in place to cover for monitoring deficit and prevent — or at least detect within reasonable time limits — further issues.
In addition to these IT Monitoring systems, many enterprises rely on the ITIL framework and have thus implemented complex IT Service Management systems (ITSM) to handle IT workloads, adherence to processes (Change Management, Incident Management, etc.), and asset tracking in Configuration Management Databases (CMDB).
As enterprises grow, business decisions also impact the scope and breadth of Enterprise IT Monitoring systems. Geographic expansion, diversification, mergers & acquisitions, organizational changes, and segmentation between technological silos and teams are all contributing factors to the implementation of multiple monitoring systems, for varied reasons — convenience, technical limitations of one solution versus another, lack of central procurement, failure to see benefits of using a common platform, etc.
Global corporations are in even more difficult situations with massive replacement projects that may run for months if not years. To add further complexity, these multiple monitoring platforms are tied closely with ITSM systems, making rip & replace initiatives approaches nearly impossible.
Despite the perils of increasing complexity and rapid change, however, it’s possible to get a firm grip on your IT environment, and simplify your monitoring solution to maintain clarity through increasing potential for chaos. Here are five tips for Modernizing Enterprise IT Monitoring…
Traditional Enterprise IT Monitoring systems that do not use algorithmic IT operations (AIOps) tools are constrained by the lack of contextual awareness in alert & incident handling, and are not able to learn and improve from past situations.
1. Clean the House
Asset Management is the discipline of managing IT assets (also called CIs, Configuration Items), assigning them SLA/SLOs (Service Level Agreements/Objectives), support contracts, responsible support teams, etc.
While this seems unrelated to Enterprise IT Monitoring, there is a strong correlation between efficient enterprise IT monitoring and a structured data model in a customer’s Configuration Management Database (CMDB).
A well-established data model should map dependencies between systems, components and subcomponents. This helps to better correlate incidents as well as changes on groups of interdependent CIs. For example, if a monitoring system reports that two network switches are down, and an entire manufacturing site’s infrastructure depends on those two switches, it’s highly likely that all the other systems at that site connected to those switches will also be unavailable. If there is a dependency between the network switches and all the other CIs (whether in monitoring or in CMDB), the likelihood of getting flooded by alerts is considerably lessened (provided that the existing monitoring systems can understand the hierarchy of the data model in the CMDB).
2. Fight Alert Burnout: Cut the Noise
Reducing noise (i.e. too many alerts) by structuring dependencies between CIs is a great way to reduce unnecessary noise, but there is still more that can be done.
Operations people know that too much monitoring equals no monitoring. Teams often configure monitoring at different levels because of organizational constraints (example: teams must leverage an enterprise tool, but do not trust it), or because there is no other solution (enterprise tool does not support monitoring methods). Multiple, overlapping alerting systems that use different protocols and output channels end up creating a lot of noise in the form of alerts, most of which end up delivered via e-mail.
The integration of infrastructure components alerting with IT Service Management systems and traditional Enterprise Monitoring platforms creates additional noise in the form of automatically generated Alerts and Incidents anytime a single threshold is breached on a single CI.
In case of bulk incidents (multiple CIs affected because of one common root cause, prolonged alert state, or thresholds violated for a given amount of time), an enterprise monitoring system may begin automatically creating many individual Incidents (one per affected CI) in the ITSM system.
Besides handling the issue(s) at stake, the operations team now must handle the tedious task of cleaning up and closing bulk and duplicate Incidents.
The permanent overexposure to alerts arriving from all sides creates alert fatigue in IT personnel — a state where all alerts, even those legitimate, are ignored because of the sheer amounts of Alerts makes it impossible to prioritize one over the other. This in turn brings the teams to work in reactive mode, where incidents are handled only when business stakeholders inform IT about downtime; or worse, it creates a state where business stops trusting IT and escalates IT issues directly to senior management with all the consequences this may imply.
Traditional Enterprise IT Monitoring systems that do not use algorithmic IT operations (AIOps) tools are constrained by the lack of contextual awareness in alert & incident handling, and are not able to learn and improve from past situations. They are also not able to proactively detect situations that may turn into major incidents, because they do not understand the potential dependencies between seemingly unrelated systems.
3. Tear Down the Walls, Bring People Together
In any case where a team may have to handle bulk incidents affecting many CIs, we should also cover how enterprises manage collaboration across teams and infrastructure silos to drive incident troubleshooting and resolution.
Today’s environments create many complexities and very few people possess knowledge covering all the tools, processes and configurations, let alone about why some architectural decisions were made and how they impact operations down the line.
Different mindsets, management methods, and monitoring tools make it difficult for teams to speak between each other, especially in strongly siloed environments where other service lines may not be always inclined to collaborate for many reasons (geographic separation, different approaches on implementation, disagreements on how technology stacks are to be optimally configured, etc.).
It’s important for senior management to understand these complexities. They should nominate advocates within their teams to engage with other service lines and work together to ensure that all teams know who are the subject matter experts in specific areas. As a result, operations teams should know which resources must be brought in during incidents requiring cross-team collaboration to drive resolution.
4. Correlation and Context
Incidents and downtime are very often the consequence of long-standing neglected issues (often because of alert fatigue), warnings that evolve into critical conditions where, unfortunately, it’s too late to intervene in time and remediate the issue before a system, application or manufacturing process goes down.
Proactive detection is essential. Without it, Enterprise IT Monitoring is just reduced to a collection of KPIs for weekly, monthly or yearly reporting. Note that simply having alerts in place is not enough, because of alert fatigue and because it’s simply impossible in modern complex environments to expect teams of engineers to sit and read email alerts.
A modern Enterprise IT Monitoring system must not pride itself simply on how it is able to monitor many things (this is taken for granted), but how it helps to bring value to customers. Proactive detection must be context-based and correlated, operations teams must be presented with actionable information, and with presumptions of potential linkage between seemingly unrelated issues.
5. Deploy an AIOps Solution
AIOps solutions are the next generation of Enterprise IT Monitoring systems. They were built to address the problems highlighted above and perfectly suited for IT Monitoring in complex enterprise environments.
Moogsoft AIOps uses complex algorithms and performs real-time monitoring over all enterprise IT systems. It integrates tightly with existing monitoring solutions and ITSM systems, overarches them and provides a single point of entry to the customer for analysis, collaboration and resolution of issues. Noise is instantly cut to present only the relevant facts.
Thanks to its powerful Algorithmic Correlation Engine (ACE), Moogsoft AIOps correlates seemingly unrelated issues and puts them into context, by presenting customers with Situations. Situations are an aggregated view of multiple monitoring events which have a common root cause, thus reducing noise and helping IT teams to address the issue immediately.
Veteran IT Operations people will not shed a tear when told that they can wave goodbye to never ending (and often forking) email threads, multiple concurrent invites for “all hands on deck” resolution calls, finger pointing and attempting to find the right SMEs for any given issue. Moogsoft AIOps provides collaborative workflows that allow various teams to work together in a per-situation shared space. Instead of bringing in all team members, AIOps can intelligently invite the best-suited individuals or teams to take care of a specific situation.
Over time, AIOps builds upon the experience and steps taken during previous Situations to further smoothen detection, improve correlation, bring in the right people and suggest resolution steps to the resolving teams.
AIOps is not only radically transforming the Enterprise IT Monitoring industry, but also shaking the ground by steeply improving the way IT operations teams work. It bridges the gap between heterogeneous IT Monitoring systems, makes sense of seemingly unrelated issues way before human operators can assert a causality link, cuts down the noise and presents operators with relevant & actionable information.
Because monitoring is often seen as a capability and not a core competency, some decision makers tend to challenge improvements in IT Monitoring and take a hardline in assessing cost savings. When encountering this kind of resistance, it’s a good idea to analyze the IT operational history of the company (especially involving major incidents) and ask:
- What did our last X major incidents cost us? (consider the financial, reputational and regulatory impacts)
- Could have we been more proactive, see developing issues and act on them before it’s too late?
- Could have we prevented downtime, or at least minimized it?
- Could have we diminished or eliminated impact to business including financial, reputational, and regulatory?
Based upon the replies to those questions, it should be clear that Moogsoft AIOps provides the necessary answers to avoid major incidents. Automated alert correlation coupled with complex algorithms cuts through the noise and presents relevant information to a company’s subject matter experts.
The correlation of seemingly isolated issues and their contextual presentation into Situations is able to proactively inform the relevant stakeholders before Alerts turn into Incidents, and before downtime occurs. Finally, AIOps helps improve MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to Resolve) scores, reduces or entirely eliminates downtime, and delivers a greater value to business.
IT Operations managers who get acquainted with these profound changes will immediately see the long-term value and immediate improvements this can bring to an organization. Decision makers can see that, beyond the technical benefits, the responsiveness and agility of their organization, as well as its capacity to deliver quickly and with quality — core assets in today’s fiercely competitive world, where IT is at the center of any company’s capabilities.
About the author Max Mortillaro