2024-02-17

Removing the Prometheus from kube-prometheus

Where we started

I’m sure that, like many, I started my Kubernetes observability journey by rolling the kube-prometheus[0] stack. It’s a giant collection of preconfigured resources that just work. No need to learn the intricacies of label rewrites, recording rules, or configuring the necessary RBAC. On top of that, you get an amazing collection of alerts and dashboards - it’s great, it just works!

A good while goes by and you’re chugging along building and deploying your applications. Alerts are firing, you deal with them. You’re using those great dashboards to evaluate some of the resource usage of workloads and are able to optimize your limits and requests.

And then, it doesn’t work anymore. Your cluster got busy, and the cardinality of even basic metrics is starting to go through the roof. You’re faced with feeding your Prometheus instances more and more memory, or turning down the retention to a measly day or two. It’s a balancing act of maintaining visibility and keeping resource usage under control. You’ve figured out that Prometheus doesn’t quite scale horizontally.

I’m grossly oversimplifying the operational process that a lot of teams go through over the course of a couple months, if not longer. All things considered though, these are pretty common problems with Prometheus at scale.

A look around reveals you’re not the only person running into this problem and solutions such as Thanos[1] and Grafana Mimr[2] offer some relief. At this point you’re able to supplement your kube-prometheus stack with a long-term storage solution to off-load your Prometheus instances to a large degree.

In essence, all long-term storage solutions fix a bloated Prometheus instance by enabling you to turn the retention way down, by moving it elsewhere. Some amount of the resources allocated to Prometheus are transfered to say, the Mimir pods, but at least these components are able to horizontally scale out.

These scalable solutions typically consist of a read and a write path, and there’s plenty information out there on them, but if you’re unfamiliar with them on a high level, check out the Mimir architecture page[3] as an example.

What’s all in Prometheus

I explicitely mention the read and the write path, because I think it’s time to have a look at what Prometheus does.

In essence, a Prometheus instance contains:

The WAL, to store not yet complete TSDB blocks,
the TSDB for complete blocks,
a query engine,
a nice GUI that lets you plot some basic graphs.

It’s a great all-in-one package. The WAL and TSDB are our write path, and the query engine is our read path that’s accessible from the GUI. I assume most of you mainly access Prometheus data through Grafana though.

Here’s the big question: considering we’re using Grafana to visualize, and have deployed Mimir to flush metrics to object storage, how many of those Prometheus features do we really still need?

Let’s see:

The WAL,
~~the TSDB~~,
~~a query engine~~,
~~a nice GUI that lets you plot some basic graphs~~.

Oh.

The Grafana Agent

After a good while of using Prometheus, I think a lot of the value of Prometheus was the intuitive metrics format, combined with the simplicity of the HTTP pull mechanism. This combination has subsequently become the defacto metrics monitoring standard for Kubernetes and plenty of other things.

Killer feature number two, at least over in Kubernetes land, was the all-in-one community driven kube-prometheus collection of resources. But at this point, how can we replace our Prometheus instance with a simpler, more condensed solution whilst keeping that nice all-in-one package relatively intact?

The guys over at Grafana Labs released the Grafana Agent[4] a good while ago now, and it fits the bill perfectly. It has two deployment modes, dubbed static or flow mode, where I think deploying flow mode is the way to go at this point. Flow mode lets you configure so called components[5] in a Terraform like configuration language to create metrics, logging or tracing pipelines.

The component we’re particularly interested in is the prometheus.operator.servicemonitors one. It does exactly what it says tin, and pretty much implements the Prometheus equivalent of service discovery and scraping based on the service monitor format. There’s a pod monitor component as well.

These pod and service monitors components scrape targets exactly like Prometheus does, making them a drop-in replacement for a Prometheus instance. The agents constructs a WAL, and just a WAL, no more local TSDB. To finalize the setup requires configuring a Prometheus remote write endpoint, for example a Mimir instance, and voila. You’ve now gotten rid of the Prometheus in kube-prometheus!

On top of being a nice drop-in replacement, the agent has some additional features such as sharding. If you deploy more than one agent they run in a hash ring and are able to spread out scraping targets among themselves. This enables you to spread out your observability load over additional nodes in your cluster, whilst protecting yourself from losing all of your scraping all at once.

Additionally, once implemented, the sky is the limit. The amount of available Grafana Agent components allows you start rolling new observability features into your monitoring stack without having to go to the drawing board. The base workload is already there after all.

[0] https://github.com/prometheus-operator/kube-prometheus
[1] https://github.com/thanos-io/thanos
[2] https://github.com/grafana/mimir
[3] https://grafana.com/docs/mimir/latest/get-started/about-grafana-mimir-architecture/
[4] https://github.com/grafana/agent
[5] https://grafana.com/docs/agent/latest/flow/reference/components/