Skip to content
PAVEL GLUKHIKH
Menu

Infrastructure

Observability Stack Design: Metrics, Logs, Traces, and Cost

How to design an observability stack that engineers trust: Prometheus and Loki-class architecture, retention and cost engineering, and symptom-based alerting.

6 min read

Executive summary

An observability stack is the combined metrics, logs, and traces pipeline that lets engineers answer 'is it broken?' and 'why?' without SSH-ing into servers. The design problem is not tool selection — Prometheus, Grafana, and a Loki-class log store are a defensible default for most teams. The design problem is the three disciplines that decide whether the stack still works in two years: bounding cardinality, engineering retention as a cost decision, and alerting on symptoms instead of causes. This article covers the architecture, the cost levers, and an alerting philosophy that keeps on-call humane.

The stack is easy; the discipline is not

The tool selection question that starts most observability projects is the least important decision in them. Prometheus for metrics, Grafana for dashboards, a Loki-class store for logs, OpenTelemetry for traces — that combination, or a vendor equivalent, is a fine answer for the large majority of teams. Arguing about it for another month buys nothing.

What decides whether the stack still works two years from now is discipline, in three specific places: bounding cardinality, engineering retention as a cost decision, and alerting on symptoms. I have watched stacks built from excellent tools rot into something nobody trusts because no one enforced any of the three, and I have watched modest stacks stay sharp for years because someone did.

Observability is the clearest case of a general truth: everything becomes operations. The deployment took a sprint. The next five years are the actual project.

Architecture: three signals, one correlation key

Each signal answers a different question, and the design goal is moving between them without re-deriving context mid-incident:

SignalQuestion it answersCost profileAlert on it?
MetricsIs it broken? How much? Since when?Cheap per point, explodes with cardinalityYes — primary
LogsWhat exactly happened?Expensive at volume, cheap to shipRarely — security events, dead-man switches
TracesWhere in the request path?Controlled by samplingNo — investigate with them

The reference shape I deploy:

 workloads ──► Prometheus (scrape) ──► rules/alerts ──► Alertmanager ──► page/ticket
     │                 │
     │                 └──► remote_write ──► long-term store (Thanos/Mimir-class, object storage)
     ├──► log agent ──► Loki-class store ──► object storage
     └──► OTel SDK ──► OTel Collector ──► trace backend (sampled)

                        Grafana reads all three

Design notes that matter in practice:

  • Prometheus stays close to what it scrapes. One Prometheus per cluster or site, short local retention — days to weeks — with remote_write to a long-term store on object storage if you need more. Do not build one giant Prometheus scraping across WAN links. I have seen that design fail at a plant network boundary and in cloud regions alike; the WAN does not care which one you meant.
  • The correlation key is labels. Service name, environment, and cluster must be identical across metrics, logs, and traces, enforced at the collection layer. If metrics say svc=checkout and logs say app=checkout-v2, every engineer pays a translation tax during every incident — the exact moment they can least afford one.
  • Loki-class stores index labels, not content. That is why they cost a fraction of a full-text index, and why the same cardinality discipline applies: labels identify the stream — service, level, cluster — and everything else stays in the log line, where it is cheap.
  • Traces are sampled or they are unaffordable. Head sampling at a few percent is fine to start. Tail sampling — keep the slow and failed requests — is the upgrade once a collector tier exists to do it.
  • The monitoring stack does not monitor itself. Run a tiny independent watcher — a cloud probe, even a cron on a box outside the failure domain — that alerts when Alertmanager goes silent. Dead-man switches are cheap. The alternative is discovering your outage from a customer.

Cardinality: the failure mode that kills Prometheus

Prometheus stores one time series per unique label combination, and cost scales with series count, not with traffic. That means one engineer adding a user_id or request_path label can multiply your series count by five orders of magnitude overnight.

Every Prometheus horror story I have been called into was a cardinality story wearing a costume.

The controls, in order of leverage: never label with unbounded values — user IDs, full URLs, container hashes; review label additions the way you review schema changes, because that is what they are; and set sample_limit per scrape job so one bad exporter degrades itself instead of the platform. The specific diagnostic queries and enforcement configs live in my working note on Prometheus cardinality control.

Retention and cost engineering

Retention is where observability budgets die quietly. The engineering move is to stop treating retention as one number and start matching storage tiers to the question each signal actually answers:

  • Metrics, high resolution (15–30s): 2–4 weeks locally. This covers incident investigation and short-term trends — which is nearly every query anyone runs.
  • Metrics, long term: recording rules aggregate what capacity planning really needs — per service, per day — and only those series reach the long-term store. A year of downsampled aggregates costs a rounding error. A year of raw series costs a headcount.
  • Logs: 1–4 weeks hot and searchable. Security-relevant logs — auth, audit, network — archive to object storage for whatever your compliance clock requires; object storage is cheap and hot indexes are not. And cut volume at the source: sample health-check and debug logs at the agent, because the cheapest log is the one never shipped.
  • Traces: days to a week, sampled.

Then make cost visible per team and per signal. The first time a service owner sees their own debug-logging line item, the volume problem largely solves itself.

Pricing beats policy.

Alerting philosophy: page on symptoms

The rule that keeps on-call humane: a page means a human must act now to protect users. Everything else is a ticket or a dashboard.

Page on symptoms — error rate, latency against SLO, availability of a critical flow. Do not page on causes — CPU high, one replica down, disk at 70%. Cause alerts carry two fatal defects. They fire when nothing is actually wrong: a node down in a redundant pool is Tuesday, not an emergency. And they stay silent for the failure modes you failed to predict, which are most of them. A symptom alert catches every cause, including the novel ones.

The practical rules that follow:

  1. Every page maps to a runbook and an action. If the honest response to a page is “watch it,” delete the page.
  2. Multi-window burn rates over static thresholds for SLO alerts — fast burn pages, slow burn tickets. This kills flapping and slow-motion misses in one move.
  3. Cause signals become dashboard context, pre-linked from the symptom alert, so the on-call lands on causes ten seconds after the page instead of being woken by them at 2 a.m. when nothing user-facing is wrong. That handoff from alert to investigation is where the stack plugs into a systematic troubleshooting method.
  4. Review page volume monthly. More than a handful of pages per shift is not vigilance. It is denial-of-sleep against your own staff, and alert fatigue is how real incidents get acknowledged and ignored.

What to write down

  • The label schema — the exact keys every signal must carry — and who approves changes to it.
  • Retention per signal tier, with the monthly cost next to each line, revisited twice a year.
  • The paging bar: the written definition of what deserves a page, so the argument happens once instead of once per alert rule.
  • Ownership: who responds when the observability stack itself is down, and what watches the watcher.

A good observability stack is not the one with the most dashboards. It is the one where the on-call engineer trusts every page, finds the cause in minutes, and the bill surprises no one. All three are design outcomes, and all three are decided long before the first incident — which is the pattern with infrastructure generally: when it is designed well, it disappears into the background and simply works. For the full component layout I run, see the observability platform reference architecture.

Frequently asked questions

What are the three pillars of observability?
Metrics, logs, and traces. Metrics are cheap numeric time series, good for alerting and trends. Logs are discrete events with detail, good for investigation. Traces follow one request across services, good for finding where latency and errors originate. They answer different questions, and a usable stack links them: an alert on a metric should land you in the logs and traces for the same service and time window.
Is Prometheus enough for observability?
Prometheus covers metrics and alerting extremely well, and it is deliberately not a log store or a tracing system. A single server also gives you neither long-term nor highly available storage. Most teams pair it with Grafana for dashboards and a log aggregator like Loki, then add tracing through OpenTelemetry once they run enough services for cross-service latency questions to actually matter. Add pieces when the question arrives, not before.
How long should I retain metrics and logs?
Decide per signal and per question, never one number for everything. A working pattern: high-resolution metrics for two to four weeks, recording-rule aggregates for a year of capacity planning, logs hot and searchable for one to four weeks with security-relevant logs archived to object storage for your compliance clock, and traces sampled and kept days to a week. Retention is a cost decision — make it deliberately.
What does 'page on symptoms, not causes' mean?
Page a human only when users or the business are affected — error rate, latency, availability of a critical flow — never on internal conditions like high CPU or one instance down. Cause-based pages fire when nothing is wrong and stay silent for the failures you did not predict, which are most of them. Symptom alerts catch every cause, including the ones you never imagined, and every page is worth waking for.

References

Related reading