A selection of live questions and answers from the audience of our recent webinar on how IT operators can best leverage intelligent observability to reduce noise, alleviate pager fatigue and switch from a reactive to a proactive workflow.
IT Operations teams are often the bedrock of the digital business, ensuring that processes and services continue humming smoothly as developers continue to evolve and increase customer value. But increasingly complex systems can flood them with alerts that get in the way of operators from doing their best work and paving the way for new, innovative services.
The second installment of Coffee Break with Helen Beal featured a chat between the host and Dave Casper, Moogsoft’s CTO, that covered how intelligent observability helps in a myriad of ways including minimizing unplanned work, noise reduction, effective collaboration and finding time for innovation.
The full webinar discussion is available on demand. Below is a selection of live Q&A from the episode.
Q: We talk a lot about being able to build better trust with the users. But what steps can we take to build better trust in IT operators?
Casper: That’s a really great question because we talk a lot about IT operators being people who are constantly putting out fires. These are always the people who care a lot, who put a lot of time and effort into wanting to make sure that they are doing their job very well, that they are not missing things, and that their own end users —often users of the business apps running within the organization— are getting the service they want. If the monitoring misses something, it’s a real psychological hit. They feel like they’ve let someone down.
Conversely, if we start getting more contemporary systems in place that can take away some of that firefighting by letting the machines actually fight some of those fires, you allow these operators to do some of the alerting cleanup, automation or refining they’ve been wanting to do for years. This allows IT operators the time to build better trust by letting the apps team see that they’re catching more stuff, proactively. This goes a long way.
This is also really about the metaphor of being able to fight machines with machines. When you invest in modern tools you give your people the power to do so, and avoid fighting machines with people, because people will just burn out.
Q: What are some of the recommendations in limiting what to measure and at what frequency?
Casper: I’ll answer this question with regard to time-series metrics. The answer really depends on the nature of the system. There’s no wrong or right answer for time series data because what the AI is doing is just learning what is normal behavior for that metric. So if a metric comes in every 60 seconds or every one second, or some other time frequency, the AI by design is agnostic to what that thing is and why it’s coming in at a certain time period or not. It is just simply measuring what it has seen and what it is working out to be normal. Then, when something deviates from that, the AI calls that out.
So, the recommendation is to worry less about what the frequency needs to be, and just know that the system can work out on its own when something is anomalous.
Q: How would you design the ‘deep dive’ of mapping alerts to a common business app, focus on the greater ‘view’ and address the typical drag to MTTR? For example: Internally? Using a CMDB? Or an app discovery product?
Casper: All of the above, especially in larger enterprises. I remember clearly as day (circa 2007) when ITIL first really hit big and every organization was building their CMDB, which are at best partially complete, partially accurate. That doesn’t mean they’re not usable. Some organizations have teams that still spend a lot of effort keeping those up to date because it’s their system of record for all devices they have in the estate.
Most important to this question is the authoritative mapping for when CI’s support one or more applications, so that doesn’t go away. If you have an outage being correlated and 20 different CI’s are part of that outage, a CMDB is the thing that tells you what those CIs map up to.
As we get towards more automated orchestration systems such as an operational management database —which a system that is not really entered by humans or updated by automated discovery tools, but is in place for the IT or for the networks to actually work, such as a DNS system— those systems can actually be tied into your correlation as well.
So the combination of a CMDB and an OMDB is still the best way of knowing which one or more services are impacted when an outage happens.
Q: How are large organizations making the transition to the world of DevOps and ITOps? How are they managing the change? How successful have they been and what are some of the recommendations that you might have on this topic?
Beal: At the risk of getting a little bit lofty, I’ll start by saying that it’s no different than any other large paradigm shift. When Cloud first happened, there were teams that did something different, there were also teams that just lifted and shifted, and the reality is that large organizations are always going to have a mix of all of the above. Most large organizations have hundreds if not thousands of app teams, and large orgs have five to ten thousand applications with individual teams supporting that. What was normally done was a recognition that there is a new way of doing things, like moving away from traditional waterfall methods, to keep up with the competition. If the competitor is releasing things faster or more with more stability, they’re going to make the money and we’re going to lose the money.
So, it would be taking a six-month waterfall of developer teams that go live on a Friday night with a monitoring team looking after that, and implementing DevOps because it’s something that’s really easy with the advent of Cloud to have the ops layer automated. But even with code-as-code and infrastructure-as-code, someone still has to monitor it.
What we’ve found is that developers first writing an app are pretty good about doing some level of instrumentation. And that’s helpful, but in practice, they get bored of it pretty quickly because they want to spend time developing code, and move onto the next feature. Then we see people saying “let’s add an SRE team to look after measurements, improvements and monitoring.”
The biggest piece of advice, based on seeing this in lots of large organizations, is that in some ways, agile happens really quickly and all the lines of business see that they’re finally free from central IT slowing them down. They spin everything up, but then find out that all of the real-world problems still exist. Things still break even if they’re in the cloud. Applications still have mistakes no matter how good of a developer you are. Outages happen, so they start reinventing all of the same support processes, so at some point, some smart CIO realizes that all the lines of business are basically reinventing what was previously a shared model.
If you are a large organization, you’ve probably had at least seven-plus years of teams shifting into DevOps. What’s really important to remember is you need to let teams get value out of tools that are simple to use, so they can get the app out the door fast enough. Remember, at the end of the day, getting the business app out faster than the competitors is really all that matters.
Q: What are the best frameworks for developers to utilize to provide key monitoring data for DevOps? Are Grafana, Kibana, Prometheus or others you would recommend?
Casper: Those are the well-known ones for getting the actual metrics, and while I’m kind of biased here, you also want a smart tool that sits across or atop this. Prometheus, for example, has got thousands of metrics, which are wonderful, but they’re often local to that Kubernetes instance.
When you have multiples of those in a large organization, you need both something that can tell you when there is an issue with more than one of them and something to compensate for the fact that nobody has time to sit and read through screen after screen of charts of thousands of Prometheus issues. This is why I think that, while those are the right tools to get the instrumentation and to get the data, we recommend also leveraging an AIOps tool that can sift through that data and surface the important stuff.
Q: In describing the AI, you say that it can understand performance over time and alert you to what is abnormal. How is this different or better than what predictive analytics vendors told us 15 years ago?
Casper: There is a slight difference between forecasting, which is something that might be done with linear regression, versus real-time, which is something that might be happening right now that is unusual. Like everything from years ago compared to now, any system that’s doing forecasting has improved and the information is better.
Specifically what we’re talking about here isn’t even related to forecasting or prediction. It has to do with looking at live real streaming data, measuring it in real-time, and the second that it went out of the ‘normal’ range, the tool notifies or alerts the operator. Rather than a tool that says “in three hours something is going to be unusual,” we’re talking about a tool that says “right now this is unusual, and by the way, these 10 other things are also unusual,” and will then correlate them.
Some additional nuance that doesn’t come up a lot is that, when you have systems that in real streams of data can surface the stuff that is more deserving of attention vs. getting missed because there’s too much data, that doesn’t mean only fault data. That doesn’t always have to be where something has broken.
We find in practice that often surfacing unusual things early means surfacing the warnings. Those warnings would very often have been missed because the IT shop is busy always putting out fires. What happens is management teams tend to ask their NOC L1/L2 operators to go into “manage by severity” mode, which means to do all the criticals first, then if you have time, come back and look at the warnings. That’s always reactive. That’s always putting out fires.
So while it’s not rocket science to figure out that if you can catch a warning early, you can be proactive and prevent things, it’s important. It really comes down to the quality of surfacing important information in real-time.
Q: With AI offerings becoming an integral aspect of many ITOM and APM solutions out there, what are the things that make Moogsoft stand out?
Casper: First and foremost, I’m familiar with all of the well-known tools out there and I can say that the math at Moogsoft is our wheelhouse. It is something that we stand behind and that we have pioneered since before the term AIOps was even coined by Gartner. We have multiple algorithms built into our solutions.
Sometimes we come across prospects or deployments where someone might think that there’s just one algorithm —by the way, this is not just true of Moogsoft, but of any tool— especially with the prevalence of things like TensorFlow or Amazon Kinesis. People get used to the idea that AI is one thing: I’m going to train a model and a model is going to do something.
For any strong AIOps solution, including Moogsoft, the math has to be solid; the algorithms have to be solid, and you need more than one algorithm across the entire lifecycle of the data coming in. Some algorithms might be doing unsupervised machine learning to surface things that are unusual, some might be doing supervised machine learning where the system learns from its operators, and others might be doing completely different things. So, I think it’s the quality of the math and the multitude of algorithms in terms of what differentiates Moogsoft.
Coffee Break with Helen Beal will continue throughout April. The next session will focus on more ways intelligent observability impacts the lives of DevOps teams (April 8). RSVP now to save your seat.
About the author
David is Moogsoft's Director, PR and Corporate Communications. He's been helping technology companies tell their stories for 15 years. A former journalist with the Sacramento Bee, David began his career assisting the Bee's technology desk understand the rising tide of dot-com PR pitches clouding journalists' view of how the Internet was to transform business. An enterprise technology PR practitioner since his first day in the business, David started his media relations career introducing Oracle's early application servers and developer network to the enterprise market. His experience includes client work with PayPal, Taleo, Nokia, Juniper Networks, Brocade, Trend Micro and VA Linux/OSDN.