We wanted to show a bit of how Create Your Own Integration (CYOI) can be used to ingest events or metrics from sources that do not have an official integration in Moogsoft. Here we highlight the use case that our SRE team had for Moogsoft, with details on how we configured the system for our use case. Also, don't miss the official release post for our Prometheus Alertmanager integration that was built based on our SRE team's usage of CYOI.
Moogsoft supports several event sources natively, making it trivial to start ingesting events for enrichment and deduplication. Here at Moogsoft, we use the CloudWatch integration to ingest our CloudWatch alarms, but as a Kubernetes shop, most of our alerts come from Prometheus. We’ve seen great benefit to routing our Prometheus alerts through Moogsoft. Alertmanager is capable of some level of deduplication, but it’s not nearly as powerful or easy to configure as what Moogsoft offers.
By implementing Alertmanager with Moogsoft we are capable of leveraging correlation as well as deduplication. Deduplication only detects the same exact event and relates it to an existing alert. Correlation on the other hand can take multiple alerts and bucket them into a single incident. This way we can reduce the amount of pages we ship regardless of the number of namespaces or clusters impacted.
While Alertmanager is not a native event source in Moogsoft, it was still an easy process to start routing those alerts into Moogsoft using the Create Your Own Integration feature.
It should also be noted that the official Alertmanager integration has been released, which should make this process even easier. We wrote our CYOI to inform the development of this integration, and felt that sharing our CYOI journey is still worthwhile. If you’re just interested in getting Alertmanager flowing into Moogsoft checkout our release post!
Configuring the Integration
To start ingesting events, we created an integration in Moogsoft using the Create Your Own Integration (CYOI) feature. Then we added the integration as a webhook to our receivers in alertmanager. Without tweaking any settings, we were already seeing alerts show up as they happened.
By default, Alertmanager will send multiple alerts in a single payload, so we enabled batch processing on the “alerts” field. If you’ve configured your Alertmanager to send one alert at a time, this setting is not necessary.
Configuring Field Mappings
The next step was to map fields in the Alertmanager events to Moogsoft fields that would be unified across event sources. Moogsoft has a set of standard data fields that it will require. These fields are used to correlate and compare events received from all of your sources. In addition to the standard fields, you can configure custom fields as tags for any additional information that is relevant to your environment. To map a field, we specify both the path in the JSON event that leads to the source data and the destination field we want that data to go to. When mapping fields in batches, the source path is specified like: “alerts[*]”.
Some field mappings were straightforward. When a label is applied consistently to every metric, we can map that label 1:1 with a Moogsoft field.
Complexity came from the fact that our Prometheus setup scrapes metrics from a lot of different sources: kubernetes state metrics, kafka clusters run through an external provider, custom exporters for software we run, etc. Other than our external labels, there is no consistent labeling scheme. And there can’t be, the metrics are just too different.
The solution here was to map multiple labels to the same Moogsoft field, such that at least one label would exist for every metric (and apply a default value to anything that still doesn’t have any of those labels). For example, some of our alerts trigger for a specific namespace in Kubernetes, but all of our alerts have a “cluster” label describing the EKS cluster of the Prometheus installation. So we assigned both the “namespace” and “cluster” labels to the “Source” field, with “namespace” getting priority.
For the more specific “Service” field, we wound up mapping five different labels to it, as the service can mean very different things for different systems. This process takes some experimentation, but having robust field mappings will allow Moogsoft to boil down a storm of prometheus alerts into just the necessary information, so it’s worth it!
Setting a Deduplication Key
The deduplication key is the first way that Moogsoft will generate value for your Alertmanager setup. Moogsoft will automatically deduplicate any events with the same deduplication key, without needing any additional configuration. Depending on your setup, this may not be much change, as Alertmanager will deduplicate too with configuration on its end. But with how many different alerts we had, using Moogsoft’s deduplication saved us a lot of effort, and as you’ll soon see, extended beyond what Alertmanager is capable of on its own.
Once the event is ingested, the data is subject to a few alterations based on rules we set in Workflow Engine. We mostly do this to normalize fields that came from different sources or authors. For example, we want both “critical” and “page_sre_kafka” alerts to page the SRE team, so we set the severity of each to “Critical”. But really this is barely scratching the surface of what Workflow Engine can do, our workflows will evolve as we get more experience with it.
This is where Moogsoft really shines. On its own, Alertmanager can group together alerts with matching labels, but what if the labels only have slightly different values? What if a bad deployment causes a service to go down everywhere it’s deployed? What if an external dependency goes down, causing an alert to fire from every prometheus you’re running? What if a node in your kubernetes cluster goes bad and suddenly every pod running on it creates an alert? Maybe you have an alertmanager configuration that covers these scenarios, but you probably don’t have them all covered. And if something isn’t deduplicating these alerts, you’re spending your time acknowledging pages and trying to figure out what they all have in common.
By setting up correlation in Moogsoft, you let it handle the hard part of turning a flood of alerts into a single incident. By setting similarity thresholds, we were able to find a combination of fields that can really determine whether alerts are related enough to group together. This is another process that will require some experimentation, but it’s very rewarding when you get it right.
For more information on correlation see our docs for a deeper dive!
The last step for configuration was another straightforward one. Whenever we get an alert that requires a response, we send the alert to Pagerduty to notify the on-call team member. We connected our Pagerduty account to Moogsoft using the native integration, and made different outbound integrations for different pagerduty services. Based on labels attached to the Prometheus alerts, we route different alerts to different Pagerduty services (and therefore different on-call rotations). It wasn’t any additional work to add this label to alerts, because we already needed it for Alertmanager to do this routing.
By adding Moogsoft to our alerting workflow, we greatly reduced the duplicate alerts we get when incidents happen, and the alerts we get now are more meaningful. Additionally, we now only have to look in one place to see Alertmanager alerts grouped together with related alerts from logs, CloudWatch alarms, Pingdom, and other sources. To see what Moogsoft Observability Cloud can do for your environment, sign up for a free trial today.
About the author
Marshall Lang is an SRE at Moogsoft, specializing in AWS architecture and observability for applications running with modern infrastructure patterns. He works to keep everything running smoothly for the Moogsoft platform, and to ensure that concise, informative alerts happen as soon as they aren't. On the weekends, he'll usually be hiking or skiing one of Minnesota's many trails, depending on the season.