In continuing our “season” of blogs, we’re asking our readers “Why are you still using Netcool when it’s a 25 year-old technology”?
Of course, some readers will say (and being a geriatric myself, I tend to agree with them) that “age doesn’t matter.” But since Netcool’s functionality is no longer offering the benefits and value that their marketing literature claims, we thought it pertinent to note that there is an alternative for you – it’s called Incident.MOOG. Here’s what you should know:
- At Moogsoft, we understand exactly why Netcool is so sticky for you (Why? Because we invented it!)
- Moogsoft can offer a path to “salvation” which:
- Doesn’t add to your costs (accountants rejoice!)
- Doesn’t add risk (executives rejoice!)
- Doesn’t require change of your existing operations and support processes or organization (employees rejoice!)
- Gives you all the wonderful benefits of Moogsoft by providing a Single Pane of Glass across your Brownfield IT, Shadow IT and Greenfield IT (customers rejoice!)
So if Fault and Service Management salvation is what you seek, the following is our “straw model” approach to enable you to migrate from “Informationally Blind” siloes to “Situationally Aware” operations processes, all with a more efficient and collaborative workflow.
Now, let’s all rejoice!
How Netcool Works (and Most Other Event Management Systems)
So you have Netcool or BMC Event Manager (BEM) or Truepoint, HP OpenView Operations Manager or Operations Bridge, EMC-SMARTS, etc. Which means you will have a specific set of Incident processes and:
- you’ll be using “machines” (Netcool, for example) to aggregate and filter your incoming events to display only the alerts which operators should be assessing.
- you’ll be using humans, underpinned by Process Auditing tools – let’s call them “Trouble Ticketing Systems” (e.g. ServiceNow, BMC Remedy, CA-Service Manager, HP Service Manager, Jira, etc). These tools are used in order to track operations performance and trigger SLA compliance.
(Figure 1. Legacy fault management steps with IBM Tivoli Netcool)
Your support operators will work to an Incident workflow – which has two common forms in use across the industry (with some slight nuances), either:
- One team assesses Alert Lists in real time, trying to see if there are multiple alerts which relate to the same issue, then creating tickets for next-level technology teams to action, or
- Many first level operations teams assess their own alerts in real-time and attempt to identify issues, which they then create tickets for.
The Trouble Ticketing system has become the tool which reports on Support operations’ performance, which is measured from the time the ticket is created. In this way, Operations staff are not deemed to have begun to Act until invited to do so from the ticket.
Note: I realize this post is supposed to be explaining a method of migration from legacy Fault Management processes to a Situation-based approach, but it’s probably worth a little deviation into the problem with the existing process first, to show you why moving away from these legacy Event Management tools is critical to improving customer experience and operations efficiency.
So this is how things work with Netcool and other systems:
- Events are received by an Event Management system.
- Rules in the Event Management system may add inertia to the presentation of those Events to operators (i.e. suppress the Alert until it has occurred 3 times in 20 minutes, only show flip/flopping Alerts if they occur three times in 10 minutes, etc.) – in turn artificially increasing the time detect actionable issues.
- Events are eventually presented to the Operator View. Operator Views are often segmented into Filtered Views where Alerts for each technology are presented in their own discrete view. This makes it very hard for operations to infer relationships across silos. However, at this point, we do have Alerts which operators can assess.
- Operations staff assess the Alerts in their specific Filtered Views (Unix looks at Unix, Windows looks at Windows, Storage looks at Storage, SAP looks at SAP – you get it). These folks must monitor their specific Alert lists in order to see if they can identify if something is happening. Often, Event storms or downstream failures (of which operators of an upstream silo are unaware of) can make it hard for operators to work out what Alerts should be actioned, further adding to the time to detect an issue – however, at this stage, an operator will assess the Alerts and if he/she sees an issue he/she recognizes, he/she will create a ticket; thus, the issue is detected.
- The ticket will now be assigned to an operator. Now, the operator can act.
- The operator has to investigate the issue, logging into devices, looking at log files, looking at Performance Trends, running diagnostics scripts, etc. The operator will also be blissfully unaware of any issues in technologies upstream of their domain, and often, equally unaware if the same issue is impacting multiple similar functions (several applications all reporting the same issue, several devices all reporting the same issue, etc.) So often multiple tickets are created for multiple operators in the same team and other teams which all relate to the same issue.
- In (less than) rare cases (than we would like), we all end up with War Rooms where we all point fingers (or the majority of the people stay silent). And in the end we finally resolve the issue.
(Figure 2. IBM Tivoli Netcool fault management process lacks ‘Situational Awareness’)
As you can see in the above, operations teams are artificially increasing their time to detect and remediate issues through using rules-based alert management and siloed operations processes which lack “Situational Awareness.”
The Good News – There Are Other Options to Netcool
Imagine, for a moment, a world where:
- Your Alert Management System could tell you if there is an issue occurring that you should react to – no more assessing scrolling lists of alerts?
- Your Alert Management System could show you quickly the relationship with the issue – causal or collateral – helping you work out whether you need to remediate or enact business continuity actions (or “Situation Awareness” as we call it)?
- Your Alert Management System could help you diagnose the causality of the issue more quickly by showing you all the alerts which relate to the issue and, if you need access to other tools, make them available to you with the click of a button?
- Your Alert Management System could capture and recycle the remediation knowledge, allowing you to quickly remediate future issues using that previous knowledge?
If any of the above scenarios appeal to you, then you should read on, because this is exactly what Moogsoft offers – meaning, earlier detection of actionable issues, earlier action (by the appropriate stakeholders), faster diagnosis, and faster remediation.
(Figure 3. Moogsoft optimizes the event management process by leveraging what machines and humans do best)
The Moogsoft approach is slightly different. We still use a combination of machines and humans – however, the capabilities of the machines are increased and in turn the capabilities of the humans is increased as well.
(Figure 4. IBM Tivoli Netcool process vs. Moogsoft fault management process)
Through our unique technology, Incident.MOOG, Moogsoft can help you reduce your mean time-to-detect issues, reduce your mean time-to-remediate issues, and reduce the number of actionable issues to process – particularly where an issue manifests itself across many silos. In addition, according to some of our customers (who have used Moogsoft to enable their existing Major Incident Escalation Processes), we can even help remediate issues before services and applications are impacted.
Could IT Event Management Really Be This Simple?
Getting any new technology product to work is really not that difficult. Getting that product to produce value is a whole other ball game. The majority of investment in operations tools actually comes from the User Processes and not the technology itself.
As anybody knows, process transformation carries a huge cost penalty to do it right. Add in multiple parties (or different companies in the case of outsourcing some levels of operations) and SIAM, and the costs – and change inertia – increases.
Being the original inventors of IBM Tivoli Netcool, we understand the investment you have made in your processes, and that’s why we have created a method of adoption which leverages your existing investment in Netcool User Processes and then helps you migrate gracefully from those processes to a more efficient, agile process which offers much earlier warning of issues. Ultimately, our method of adoption is meant to allow you to migrate off the drug that is Netcool.
The Netcool Migration Journey Starts Here
No doubt, there is some part of your service delivery fabric which is prompting the need to implement more valuable processes:
- New technology from which the events cannot be parsed into Netcool because you are unable to work out which are the important ones; which would flood operators with alerts for which they do not have a knowledge article/runbook (so-called “Shadow IT”).
- Frequency of change of the infrastructure or applications which makes it hard to maintain the changes to filters and rules required to make it easier for operators to assess Netcool for important alerts.
- A migration to Cloud or Elastic / Software-Defined Services, where single faults do not cause impact and the millions of combinations of coincident Fault conditions which cause performance and capacity degradations, leading to applications and service failures, are impossible to model in order to monitor for.
Whatever the case, you will have an operations team which cannot leave the drug called Netcool simply because there is a risk that they will be blind to Faults.
So the Moogsoft Migration Method (shall we call it “Mmm” – sounds soothing doesn’t it!) works with your investment in Netcool processes, not against it, by taking a simple three step approach:
- Integrate Moogsoft into the Netcool Process
- Transform the Netcool Process
- Migrate from the Netcool Process
Simple, huh? Here’s exactly how it works:
How to Easily Integrate Moogsoft into the Netcool Process
At a conceptual level, here is the Netcool driven process to Detect issues:
(Figure 5. IBM Tivoli Netcool legacy approach to incident detection)
The value of Incident.MOOG lies in it’s ability to detect actionable issues earlier and offer “Situation Awareness” to the stakeholder operations teams (both causal and collateral parties). The premise of the product is that it ‘push notifies’ the appropriate operators that there is a situation occurring that they should be aware of.
But since your operators aren’t used to looking at Incident.MOOG, and you do not want (or need) the additional costs of having your Operations folks looking at two tools (which, let’s face it, is almost impossible to make work) we integrate the Moog functionality into the Netcool process.
(Figure 6. Moogsoft functionality can be integrated into IBM Tivoli Netcool process)
You can use Incident.MOOG to take events aggregated into Netcool, process those events, infer situations (anomalies in real-time without training or models) and then present those situations as a “Situation Alert” back into Netcool.
In Netcool, you can have a specific Filter View for Moogsoft’s Situation Alerts. When a Moogsoft Situation Alert is presented in Netcool, you can create a ticket for it. See, in this situation we have already reduced the mean time-to-detect issues with no changes to the existing Netcool processes or workflow.
When an operator is then assigned to a ticket (using your favorite ticketing system), they click the Moogsoft Situation Room URL in the ticket and enter the Situation Room – a virtual Incident Room. Any content and remediation stage information produced in the Situation Room is then automatically synced back to the ticket, meaning your “Operations Process” auditing reports do not need to be changed at all.
Here is where it starts to get interesting: if you trust Incident.MOOG to create “Appropriate Tickets” for the operations teams which need to be aware of a Situation, the next stage is a simply logical step – automatically create the ticket from Moogsoft!
(Figure 7. IBM Tivoli Netcool users can automatically create tickets within Moogsoft)
You’ll have started slowly with your Incident.MOOG transformation journey, picking certain applications and services, so your operators will still be creating tickets manually for domains or applications or services which are not processed by Moogsoft – meaning a combination of Moogsoft’s auto-created tickets and Netcool manually-created tickets.
But most importantly, these tickets will not be duplicated because now you will begin to migrate all of your Netcool aggregated events into Moogsoft. So when all your technology operating domains are transitioned to Incident.MOOG, you can migrate away from the Netcool UI and into the Moog UI!
(Figure 8. Once technology operations are transitioned to Moogsoft, you can transition away from the Netcool UI to Moogsoft UI)
And then you can begin the journey to migrate from the Netcool Probes to the Incident.MOOG LAMs. The important thing here is that you are making this migration in your own timeframe, and at no point putting at risk your ability to monitor your existing fabric.
(Figure 10. Moogsoft gives you visibility into Shadow IT, Brownfield IT and Greenfield IT events to achieve 360° Situational Awareness)
And your implementation architecture will look like this as you migrate away from (and ultimately turn off) Netcool!
(Figure 11. IBM Tivoli Netcool is then turned off and users are migrated to Moogsoft to achieve a proactive, situation-aware and collaborative fault management process)
So, there you have it – a smooth transition from reactive, alert-centric, linear processes to a proactive, situation-aware, collaborative processes!
Get started today with a free trial of Moogsoft AIOps — a next generation approach to IT Operations and Event Management. Driven by real-time data science, Moogsoft AIOps helps IT Operations and Development teams detect anomalies across your production stack of applications, infrastructure and monitoring tools all under a single pane of glass.