If GitHub stars are any indication, Prometheus has been doubling in usage year over year since its inception. While at Moogsoft we love Prometheus as the metrics foundation of our observability platform, there were some challenges to overcome to make it the rock-solid piece of our stack it is today.
Prometheus is a fantastic time-series database (TSDB) for gathering and alerting on telemetry data. However, it is not a one size fits all solution, and even within the areas it works well, there are some difficulties to be aware of.
In this post, I want to highlight some areas where Prometheus may add friction to your workflow. We at Moogsoft have been using Prometheus for over two years as our primary tool for gathering metrics and we’ve made it work with some tweaks, learning from others, or improvements from the open-source community that continue to get better with age. Additionally, we’ve built on top of a strong foundation with our own product Moogsoft, for a true incident platform. Read on for more!
Discovery Outside of Kubernetes
Discovery in Prometheus (Prom) is both a blessing and a curse in some ways. For our use case at Moogsoft, it works extremely well for 99% of our use case, which is Kubernetes. Within Kubernetes, Prometheus can automatically scrape any pod/service you annotate to be scraped. This makes it extremely easy to monitor and gather details about ephemeral instances. All Kubernetes metrics are also gathered easily.
The difficulty comes when you are attempting to gather metrics from external sources. This is an area where I feel Prometheus may not be the right tool for everyone.
We also leverage a managed service provider for Kafka rather than run it in our clusters. This provider exposes a Prometheus metrics endpoint for us to scrape, but the IP address is not known before spin up. After we spin up the cluster we then need to update scraping rules on Prometheus to pull the metrics effectively. Since our clusters are fairly static this hasn’t proven overbearing. As such we have taken the hit on toil here.
This is an area where discovery tools have gotten better. For EC2 as an example, you can leverage the EC2 service discovery for instances that come and go in an ASG. So while this is an area of improvement, it’s not yet as clean as push-based systems that allow for caching. This should be taken into consideration when deciding if Prom is right for your stack.
Note: Yes, Prometheus has a push-based option, via the PushGateway. However, it does change the construct of Prometheus into a fully pushed-based system. It can work, but I would highly recommend using this for edge cases on short-lived services, rather than a standard.
Managing AlertManager rules has become infinitely more simplified due to two things.
Prometheus Operator adds the ability to divide your rules into multiple files, rather than one large file. Having read through a rules file with 100 rules, this is an extremely welcome change. By breaking out your rules, you can group them by service, by intention, or by severities. The world is your oyster. Check out the docs on merging in new rules.
Mixins are another great addition, and once we implemented them we’ll never go back. This will cover many of the general use cases for alerting in your system. It is also designed to be system agnostic, which is another feather in its cap.
We have also implemented AlertManager into our own product, Moogsoft, to help with deduplication and correlation. This helps us to reduce the number of callouts during an incident, all while helping make more sense of the alerts we are receiving.
Metric Naming/Labeling Schemes
To be fair this concern exists regardless of the TSDB that you choose to leverage. It’s still worth calling out because getting a naming scheme documented early can solve many headaches later. This is exceedingly important for companies that have many teams working on many independent services.
By ensuring a strong adherence to naming and labeling schemes, you can simplify finding custom metrics generated by your applications.
This being said, there are libraries for many languages that will automatically generate metrics. These of course come with their own naming and labeling. Either be prepared to only enforce your scheme on custom metrics, or be prepared to manage relabel configurations.
PromQL is a fantastic query language once you become familiar with it. The problem is ensuring that all of your engineers have the same basic understanding of how to use it. This can be solved by Grafana dashboards, but there will always come a time where a metric will need to be grabbed manually. Personally, PromQL didn’t click for me until I read PromQL for Humans - I would highly recommend giving it a read!
Once you have wrapped your head around that, you then need to be prepared for questions on how Prom handles things like rate. This isn’t explicitly about PromQL but figured I’d shim it in this section. This won’t necessarily be needed by everyone, but it’s important somebody has the details on how Prom handles queries to answer questions when things return decimals where no decimals are stored. PromLabs has written a great blog post on how rates work, and why you don’t always get what you’d expect.
Running Prom was easy when we had a couple of clusters. That reality didn’t last long once we took Moogsoft out of beta. Once that happened, we needed federation, or we’d receive the wrath of the devs hopping around multiple Grafana instances to find what they needed.
Rather than federate back to a single centralized Prometheus we decided to also look forward to long-term retention. Enter Thanos. We have been federating our metrics using Thanos since October of 2020 and it has handled the past year of metrics without an issue. Storing in S3 has also greatly reduced our costs compared to a GP2 persistent volume(PV). It’s also a little more portable if the availability zone(AZ) your PV is in dies.
Thanos also introduces a simple solution for HA. Following in the footsteps of its namesake, Thanos will deduplicate metrics that come in from your Prometheus instances. This allows you to run two Prometheus per cluster while ensuring you aren’t storing the same metric twice for the long term. By doing this you can make changes to Prom1 while Prom2 keeps on ticking, or take a full-blown outage in that AZ without worrying about metric loss.
An Inevitable Conclusion
At the end of the day, Prometheus is an incredible tool that we lean on for understanding the internals of ours systems every day. So while it’s important to be aware of the potential friction above, don’t let them scare you away. Again any tool will have its idiosyncrasies. It’s not typically a reason not to use the tool, but rather something to understand so you can get the most out of it.
Hopefully the above helps you start or continue your journey with Prometheus. As I said we have been using it for over two years, and still learn more every day. Soon we will also be working on leveraging remote write to ship our metrics directly into Moogsoft in order to take advantage of anomaly detection. We’ll post more on that later as work gets completed, but if you want to try it yourself sign up for a trial! You can start shipping your own metrics from Prometheus into Moogsoft using remote write directly to our API.
I hope that you learned something new today, or at least gained some insight into some areas of Prometheus. Anything else you’d like to hear about? Questions? Or want to share your own experience? Reach out on Twitter @moogengineering. We’d love to hear from you!
About the author
Thom Duran is a Director of SRE at Moogsoft, where he leads a team of SREs that are focused on building the platform for Moogsoft Observability Cloud, as well as spreading best practices to enable a DevOps culture. Focusing on stability and automation has been part of Thom’s life for the past decade. He got his start in the trenches of a traditional NOC. From there he moved into more traditional SRE roles focused on availability and monitoring at GoDaddy. At Moogsoft Thom’s goals are always about driving efficiency, making sure people love what they do, and that people have a place where they can excel.