IT is pretty complex, and getting more so every year as we add more and more layers of abstraction between users and machinery. Each of those layers — hardware, operating systems, hypervisors, container platforms, cloud platforms, middleware, PaaS, applications — needs to be monitored and managed to detect and resolve any faults that occur. Increasing numbers of operations teams are coming to the realization that the old management tools and techniques simply can’t keep up with this increased complexity, and with the accelerated rate of change that it comes with.

To give you an idea of the event volumes a modern IT infrastructure can generate, one of our customers was trying to process more than one hundred million events per day

The only way to deal with those sorts of numbers is to use intelligent software to help with assembling a picture of what is actually relevant, instead of subjecting human operators to the firehose of event and alert data. Instead of trying to use static rules to filter individual alerts to understand whether they are useful or not, this approach leverages mathematical and machine-learning techniques to sift those events and put them into context, identifying the small number of problems that all those alerts are the symptoms of.

The result for that customer was a straight 99%+ reduction in event volume by filtering noise, and then correlation of the remaining, significant alerts into a couple of hundred incidents per day. Operations specialists who had just thrown up their hands in horror when faced with tens of millions of events were able to manage a couple of hundred actionable situations easily.

Going it Alone (Bad Idea)

We at Moogsoft have been pioneers in this space, but other companies are beginning to talk the same talk (although it has yet to be proven whether they can actually walk the walk). However, the field of machine-learning is still young, which means that definitions and boundaries can be vague and changeable. This means that, from time to time, we talk to people who are thinking of trying to build something like Incident.MOOG with generalist tools such as Hadoop.

There is this reductionist tendency in IT, which, at its extreme, manifests as “bah, I could do that with a shell script!” This mindset focuses on the overlap between different areas, rather than the specialization. To take an example outside our own field, why would the same organisation use Salesforce, ServiceNow, and JIRA, when there will be substantial overlap between the data stored in each tool? The reason is that different groups use different tools for different purposes, and trying to force all of them into a single tool will lead to some hodge-podge that is not really suiting anybody’s needs.

In the same vein, I always advise people to resist the temptation to roll their own event management solution. This is not just self-interest, trying to sell one more license of Incident.MOOG! In any case, our prices are extremely reasonable — you can find basic pricing right on our website, but if you would like a more exact quote, call us up and we’ll go through the options together.

Rather, the reason not to try to build an in-house custom solution is that, for almost all companies, this would be a terrible idea. First of all, those generalist platforms are just that — general. The difference between buying a turnkey solution and trying to build your own is sometimes seen as being akin to buying pre-built furniture versus assembling it from a flat pack. There are so many excellent open-source components available that it can seem very simple on paper to put together an architecture.

The reality is more like being handed a couple of logs and a set of carpentry tools. Sure, the wood is high-quality and well-seasoned, and the tools are ergonomic and made from fine steel — but there is a lot of skill required to turn out anything halfway fit for purpose. Making it look nice often ends up as a very secondary consideration.

We ourselves take the same approach to the ingredients, using a number of open-source components such as RabbitMQ or Sphinx rather than inventing the wheel. We focus our development efforts on the areas where we can make a difference — and we would advise operations departments considering a development project to do the same. Let Moogsoft build the algorithms and the platform, and you focus on the process that is specific to your organisation.

The second factor to consider is the technical debt that is incurred by choosing to build a tailor-made solution. It can be fun to put together a fantasy architecture on a whiteboard or a slide, but the reality is that building a usable tool is a long hard slog, with lots of unglamorous messing around with different representations of strings or dates or whatever. All of the hours required to do that will need to be “stolen” from other projects and activities, which brings opportunity costs — what else could the team be doing if they weren’t building a machine-learning tool?

The drain on resources doesn’t stop once the tool is built and deployed; rather, it has barely begun. There is always a new version of this, a change in the format of that, or a new feature required for the other. There will be performance bottlenecks, race conditions, and the sorts of weird unforeseen bugs that you only find in real-world conditions, and never in testing. By the way, you didn’t forget to include testing and QA in your project plan, did you?

All of this care and feeding is ongoing and never ending; you can never call the tool done and stop development — not if you expect to continue using it, anyway. This is exacting and not particularly exciting work, but it absolutely has to be done.

Leave it to the Professionals (Solid Strategy)

The bottom line is that this is hard graft, and it takes a decent-sized team with good levels of expertise. Getting Incident.MOOG to where it is today took a team of dozens of world-class developers four years and counting. They don’t ever stop, either; we do a release every three weeks or so, and each one includes roadmap features, enhancement requests from users, and, yes, bug fixes (occasionally one does get through, despite our best efforts!).

There are vanishingly few companies for whom this sort of internal effort would make sense — or even be feasible. You are almost certainly better off taking advantage of the hard work we have done over the years, and the work we continue to do on top of that.

This doesn’t mean that you are stuck in a one-size-fits-all world; in fact, the early stages of a commercial engagement with us will look very similar to a well-run development project, with extensive requirements gathering and validation of key performance indicators. If it turns out that you do need something very specific, one of those features that we build and maintain is our SDK, which allows users to customize our product and access our functionality programmatically however they want.

If you want to get a feel for what Incident.MOOG can do for you, you can request an evaluation here, or contact us to to arrange a consultation about your specific requirements. Just please don’t jump into developing your own solution; we’ll both regret it.

Get started today with a free trial of Incident.MOOG—a next generation approach to IT Operations and Event Management. Driven by real-time data science, Incident.MOOG helps IT Operations and Development teams detect anomalies across your production stack of applications, infrastructure and monitoring tools all under a single pane of glass.