Skip to content
PAVEL GLUKHIKH
Menu

Engineering Note

Prometheus cardinality control: find it, kill it

Hunting high-cardinality Prometheus series with tsdb analyze and topk queries, then killing them with relabel drops, limits, and recording rules.

2 min read

TL;DR

Prometheus memory and query cost scale with active series, not samples — one label carrying unbounded values (pod hash, request ID, user ID) can double a server's footprint overnight. This is the loop I run: measure with promtool tsdb analyze and the TSDB status API, confirm offenders with count-by queries, then kill them with metric_relabel_configs drops, sample limits as a guardrail, and recording rules so dashboards stop needing the raw series at all.

Measure before you delete

Prometheus dies by series count, not sample rate, so the first job is finding out where the series actually come from — not guessing. Start with what the TSDB itself reports. On the server:

promtool tsdb analyze /var/lib/prometheus/data
# or a specific block:
promtool tsdb analyze /var/lib/prometheus/data 01JKJ5Z0Q8XYZ...

The sections that matter: label names with highest cumulative label value length, highest cardinality labels, and highest cardinality metric names. The same data lives in the UI under Status → TSDB Stats, and in the API:

curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName[:10]'

Then confirm with PromQL — carefully, because cardinality queries about cardinality are themselves expensive. Run them ad hoc, not in dashboards:

# Top metrics by series count
topk(20, count by (__name__)({__name__!=""}))

# How many values does one suspect label have on one metric?
count(count by (id) (container_memory_usage_bytes))

# Churn: series created recently (restarted/redeployed workloads)
sum(rate(prometheus_tsdb_head_series_created_total[10m]))

The usual convicts: histogram _bucket families, kube-state-metrics with every label passed through, and anything where a developer put a request ID, user ID, or path-with-parameters into a label.

Kill it at ingestion

metric_relabel_configs runs after the scrape, before storage — this is where you drop what you’ll never query:

scrape_configs:
  - job_name: kubelet-cadvisor
    sample_limit: 50000        # guardrail: scrape fails loudly instead of flooding
    metric_relabel_configs:
      # Drop whole metric families you don't use
      - source_labels: [__name__]
        regex: 'apiserver_request_duration_seconds_bucket|container_tasks_state'
        action: drop
      # Strip a noisy label (only when it can't collapse series into duplicates)
      - regex: 'pod_template_hash'
        action: labeldrop

Two rules I hold to:

  1. Drop metrics, not labels, when in doubt. labeldrop that collapses two series into one produces out-of-order/duplicate errors that are worse than the cardinality.
  2. Set sample_limit on every job you don’t own. A failed scrape with prometheus_target_scrapes_exceeded_sample_limit_total incrementing is an alert; ten million surprise series is an outage. label_limit and label_value_length_limit are the same idea for label abuse.

Recording rules: make the raw series unnecessary

Dashboards are what keep teams addicted to high-cardinality series. Recording rules pre-aggregate to the granularity people actually look at:

groups:
  - name: cardinality-relief
    interval: 30s
    rules:
      - record: namespace:container_cpu_usage_seconds:rate5m
        expr: sum by (namespace) (rate(container_cpu_usage_seconds_total[5m]))

Point the dashboards at the recorded series, watch usage on the raw metric for a couple of weeks, then drop it at ingestion. That sequencing — substitute, verify, drop — is the difference between cardinality control and a Friday incident, and it belongs in the design of the observability stack rather than in firefights.

What I write down

Per Prometheus server: the active-series budget (prometheus_tsdb_head_series alert threshold), every drop rule with the reason and date, and who owns each scrape job. In my lab and production platforms the budget alert fires at 80% — cardinality problems are cheap at 80% and expensive at OOM.

Frequently asked questions

What counts as high cardinality in Prometheus?
There's no absolute number — the problem is unbounded label values. A label with 10 stable values is fine; a label carrying pod template hashes, request IDs, or client IPs grows forever and churns on every deploy. A single-server Prometheus is typically comfortable in the low millions of active series; one bad label can add that alone.
Is it safe to use labeldrop in metric_relabel_configs?
Only if removing the label cannot make two series identical. If pod_template_hash is the only differing label between series, dropping it creates duplicate series and ingestion errors. Prefer dropping whole metrics you don't query, and use labeldrop for labels that are pure noise on top of an otherwise-unique series identity.
Why are histogram buckets the usual offender?
Every histogram multiplies its label combinations by the bucket count — a metric with 10 label sets and 15 buckets is 150+ series before you add _sum and _count. Classic examples like apiserver_request_duration_seconds_bucket generate tens of thousands of series. Drop buckets you don't query, or move to native histograms.

References

Related reading