During our recent webinar, “Managing Large, Complex IT Operations Projects,” Moogsoft Chief Evangelist Richard Whitehead received lots of great questions from the audience. Due to time constraints, his answers had to be brief. So we thought it would be well worth taking a second look at these questions and diving in a bit deeper. Richard has kindly let me share the microphone, so to speak – so you can hear my thoughts on these topics.
Q. Are you reliant on the SMARTS root cause to create your Situations?
A. SMARTS was conceived when a single root cause always mapped to a resolvable incident. Today that is no longer true. The inventors of Incident.MOOG pioneered Root-Cause as a concept, but recognized in 2011 that its day was done. Incident.MOOG detects abnormal behavior as a contextual narrative expressed as a cluster of Events (a Situation). The stakeholders to the Situation can quickly work out whether they are the causal or impacted/collateral party to the Situation and react accordingly. Typically Incident.MOOG detects actionable issues at least 30 minutes before existing tools and processes, all without the need to resort to models of previous behavior.
Q. And if you want as close to a raw event stream, then how does that work with SMARTS? Are you pulling out the complete topology and status from the SMARTS system?
A. Incident.MOOG just needs the raw timeline of the Event data. Alerts that have already been de-duplicated and refined can sometimes obscure what is really going on. We have ways of recreating the raw events post SMARTS or Netcool, but it is often easier to start from scratch.
Q. Do you need to have a normalized input in order for Incident.MOOG to perform?
A. Yes and no. Incident.MOOG performs best with consistent timestamps (i.e. UTC) along with common token and end of record delimiters per source feed, or better still with JSON encapsulating the content of the message. We use that to create our internal event objects. If pre-formatted events are available that can reduce the time to implement Incident.MOOG, but today we have a whole library of pre-created integrations.
Q. With incomplete inventory how can MOOG determine an impact of a problem accurately?
A. It’s all about the data that’s clustered. For example, if the cluster of Events includes a VM or server, then if the incomplete CMDB has a mapping of hostname/application, then Incident.MOOG can decorate the Situation with the impacted application. The same is true for Telco CPEs etc. What we do not do is rely upon a topology to generate rules that create our clusters, and as we capture all nodes that have participated in a situation there is a greater chance we can find the impact in the CMDB than if we just had a SMARTS-like root cause with a single node.
Q. You ‘ve mentioned natural language processing (NLP) a few times. Can you provide an example?
A. Incident.MOOG uses techniques of unsupervised and supervised machine learning. How it performs that is the subject of our numerous patent applications, but essentially we leverage techniques of semantic analysis and linear algebra to understand the “meaning” of the event messages. No Rules are used to describe the type and usage of the language.
Q. Let’s say there is a situation where the lines of business (LOB) don’t know how they actually use IT/IS and it leaves the IT department to guess. There are no CMDBs so everyone guesses what a critical service is to the LOB. You said Incident.MOOG can figure critical events from 8 down to 1. How can the tool do this without human dataflow analysis/understanding and documentation?
A. Let’s separate what we mean by Critical. There are Events which have Severities and there are Applications which have Severities/Priorities. Traditional Event Management systems rely on the Severity of an Event as designated by the Agent or Element Management System. However in many cases these Event Severities are very badly categorized – especially when it comes to Syslog, Application Logs, Microsoft, etc. As you will know, this leads to organizations missing very important Event information – because their Netcool operators only look at Critical Severity Alerts. So with Incident.MOOG, unless you explicitly tell the system, it ignores Severity. There are several layers of algorithms within the Incident.MOOG platform. The first layer is the “Cleaning” algorithms. What they do is to [in near real time], assess each new Event as it is received in order to work out its level of ‘significance’. In this way, Incident.MOOG is able to filter our noise or irrelevant Events automatically…no rules…so Incident.MOOG does automatically what all your thousands of rules in Probe Rules Files do. Then we come to Application Priorities. If (as discussed in an earlier response) Incident.MOOG has decorated a Situation with the impacted Application, then that Application name can be queried against a Service Catalog or other database in real-time and the impact importance of the Situation can be set automatically. Event better, individual ‘count-down’ SLAs can be triggered, indicating to support staff how many minutes they have before they will exceed their SLA for the given application or service.
Q. When I was in charge of a Netcool deployment, the SYSLOG & MTTRD probes rules I maintained where quite lengthy (over 3000 lines). Does MOOG help streamline these?
A. The reason that the rules files are so lengthy is that Netcool maintains a detailed model of the system which it uses to encapsulate alert and event escalation in those set of rules. Incident.MOOG still needs to map the source events to our internal event format but after that all of the logic in the rules files is dispensed with. Typically a configuration file for our syslog LAM is measured in tens of lines not thousands.
Q. Does Incident.MOOG have an OID2 rules converter?
A. Incident.MOOG can read any standard SNMP MIB, and converts OIDs automatically.
Q. Does Incident.MOOG perform SMTP to event conversion?
A. Yes – we have a Mail LAM. LAMs (or Link Access Modules) which can process mail as a source for events.
Q. How are the services defined/identified? And how is customer impact determined? Events don’t always mean impact.
A. Incident.MOOG is designed to detect ‘abnormal behavior’ that you should take a look at, without a model or previous history. These are what we call Situations. If that Situation includes Events which contain identifying content that is matched to applications or service names in a CMDB or other database then the impact can be determined automatically without the need for a traditional Business Service Model.
Q. What happens when there is no Netcool?
A. We aggregate directly or work with your other tools (syslog, Splunk, Solarwinds, home grown tools, Log4j, etc.).
Q. If Netcool exists in the infrastructure, is there a difference in the integration between pre-IBM Netcool and post-IBM Netcool?
A. We have a standard integration into the Netcool Gateway (with a special filter to re-duplicate the Alerts). This will work with any version of Netcool.
Q. What can you tell us about the Incident.MOOG installation process in a triple play operator with 3 million customers?
A. Well, we have a great example of that in a PTT in Europe where they tried to bring together three Netcool systems covering Mobile, Transmission and IP using a multi-tier Netcool. Didn’t work! We put Incident.MOOG across all sources of the triple play. Deployment is not complex with Incident.MOOG, however when it comes to integrating mobile networks, it’s useful to have some location-based data within the content 🙂
Q. Is Incident.MOOG’s position always to be above everything other event source or there are configurations where it is below.
A. The beauty of Incident.MOOG is that it needs no understanding of the managed entities and no model of the infrastructure. That means you are free to change and be as agile as you like without the need to maintain models or codebooks or topologies or Event content. Just add new content into Incident.MOOG.
Q. In the installation process is there full autodiscovery?
A. Incident.MOOG is not dependent on Topology.
Q. Are there preset problems detections based on experience gained from the network?
A. No. That’s the point – Incident.MOOG uses unsupervised and supervised machine learning to determine Situations. That makes Incident.MOOG agile to changing infrastructures like elastic compute, cloud, software defined networks (SDN) and virtual infrastructures.
Q. Can Incident.MOOG fully replace Netcool in a given environment or is the logical thing to do is always be above Netcool?
A. Incident.MOOG can replace Netcool. However, if you keep Netcool, the logical placement is above Netcool. If you are a managed service provider who needs to notify your customers (who have Netcool or other tool) then Incident.MOOG can create Situation Alerts which can be input to Netcool, BMC Remedy, ServiceNow or other tools. Then you can bring your customers into the specific Incident.MOOG Situation Room where you can keep them informed of the status of the Situation that impacts them.
Q. Is there a customer-end sort of data gathering tool?
A. Yes, you can involve customers within the Incident.MOOG Situation Room, but you can keep them separate from the support/resolution dialog and if multiple customers are impacted, keep those customers separate from each other.
Q. Why did the Micromuse Netcool use the double-decker bus icon?
A. Netcool used a Double Decker Bus icon because the company was originally “Omnibus Transport Technologies Limited”…then became Micromuse! That company was founded by Moogsoft’s founders – 21 years ago.
Get started today with a free trial of Moogsoft AIOps — a next generation approach to IT Operations and Event Management. Driven by real-time data science, Moogsoft AIOps helps IT Operations and Development teams detect anomalies across your production stack of applications, infrastructure and monitoring tools all under a single pane of glass.