Should metrics, logs, and traces live in one system or three?

Three specialized backends behind one query layer beats one do-everything database for most teams. Metrics engines, log stores, and trace stores make different indexing tradeoffs on purpose. Unify at the collection layer (OpenTelemetry) and the viewing layer (Grafana), and let each signal type use storage built for its access pattern.

How long should telemetry be retained?

Match retention to query pattern, not compliance reflex: high-resolution metrics 15–30 days, downsampled rollups 13 months for capacity planning, logs 14–30 days hot with archives to object storage if audit requires, traces 7–14 days sampled. Most queries hit the last few hours; paying SSD prices for data nobody queries is the most common observability budget leak.

What is tail-based trace sampling and why use it?

Head sampling decides at request start whether to keep a trace; tail sampling decides after completion, so it can keep every error and slow request while discarding routine successes. You retain the traces you actually investigate at a fraction of the storage. The cost is a buffering tier that must see all spans of a trace before deciding.

How do you stop alert fatigue as teams grow?

Route by ownership label so alerts reach the team that can act, page only on symptoms tied to SLOs, demote cause-based alerts to tickets or dashboards, and use inhibition rules so one outage produces one page. Then review: any alert that fires without producing action gets tuned or deleted at a monthly review.

Reference Architecture

Observability Platform Architecture

Reference architecture for observability: metrics, logs, and trace pipelines, tiered storage, retention economics, alert routing, and dashboard governance.

By Pavel Glukhikh April 15, 2026 6 min read

Design summary

An observability platform is the shared pipeline, storage, and alerting infrastructure that turns telemetry from every system into signals engineers can act on. This architecture uses OpenTelemetry collection at the edge, Prometheus-compatible metrics with a long-term store, Loki-class log aggregation on object storage, and tail-sampled tracing — with retention tiers chosen by query pattern rather than habit. It covers the pipeline topology, why storage economics should drive retention policy, how alert routing stays sane as teams multiply, and the governance that keeps dashboards from becoming a graveyard.

Component stack

OpenTelemetry Collector (agent + gateway tiers)
Prometheus + Thanos or Mimir (long-term metrics)
Loki (log aggregation) on S3-compatible object storage
Tempo or Jaeger (traces, tail sampling)
Grafana (dashboards, provisioned as code)
Alertmanager (routing, inhibition, silences)
Kafka or in-collector buffering for log surge protection
PagerDuty/Opsgenie-class on-call integration

Purpose and requirements

Observability is a platform, not a per-team tool choice.

Every organization I have worked with eventually rediscovers this the same way: each team runs its own Prometheus and its own log stack, and incident response becomes a tour of seven UIs with seven retention policies. This architecture is the shared platform that replaces that — sized for a mid-size estate (tens of services, 50–500 nodes) run by a single platform team.

Requirements:

One pipeline in, three signals stored. Applications instrument once (OpenTelemetry); the platform routes metrics, logs, and traces to the right backends.
Costs scale sub-linearly with telemetry volume. Storage tiers and sampling are designed in, not retrofitted after the first invoice shock.
Alerts reach owners, and only owners. Routing is driven by labels, not by a human dispatcher reading a shared channel.
The platform outlives its dashboards. Governance keeps content trusted; everything is provisioned from Git.

Topology

  SERVICES / NODES                       (per-host or sidecar)
  app SDKs (OTel) ---> [ OTel Collector: AGENT tier ]
  node exporters            |  batching, k8s metadata,
  log files (tail)          |  local buffering
                            v
             [ OTel Collector: GATEWAY tier (x2+) ]
             sampling decisions, redaction, routing,
             surge buffer (memory/Kafka)
              |            |             |
       metrics|        logs|       traces|
              v            v             v
   +----------------+ +-----------+ +-------------+
   | Prometheus     | | Loki      | | Tempo       |
   | (15-30d local) | | ingesters | | tail-sample |
   +-------+--------+ +-----+-----+ +------+------+
           |                |              |
           v                v              v
   +-----------------------------------------------+
   |        OBJECT STORAGE (S3-compatible)          |
   |  Thanos/Mimir blocks   Loki chunks   traces    |
   |  downsampled rollups   (14-30d hot,  (7-14d)   |
   |  (13 months)           archive after)          |
   +-----------------------------------------------+
           |
           v
   [ Grafana ] <--- dashboards provisioned from Git
   [ Alertmanager ] --routes by team label--> on-call
                     (inhibition, silences)   (PagerDuty-class)

Component roles

Collector agent tier. An OpenTelemetry Collector on every node (DaemonSet in Kubernetes) tails logs, receives spans and metrics from app SDKs, scrapes exporters, and enriches everything with infrastructure metadata — namespace, node, owning team. Enrichment happens here because this is the last place the context is cheap to know.

Collector gateway tier. Two or more central collectors where policy lives: tail-sampling decisions, PII redaction, per-tenant rate limits, and routing to backends. Splitting agent from gateway means policy changes are a gateway rollout, not a fleet-wide restart. A surge buffer (in-memory queues, or Kafka once log volume justifies it) sits here so a backend outage causes delay, not data loss — an outage is precisely when you most need the logs that were just generated.

Metrics store. Prometheus (or an agent-mode scraper) holds 15–30 days of raw resolution locally; Thanos or Mimir compacts and downsamples into object storage for 13-month rollups — long enough to see two year-ends for capacity planning. Cardinality is the failure mode to engineer against: per-request labels, user IDs, and unbounded pod-name explosions will take the store down long before disk fills. My cardinality control note covers the enforcement tactics; the architectural point is that the gateway tier is where you can drop offending labels for everyone at once.

Log store. Loki-class: index only labels, keep log bodies as compressed chunks in object storage. This is what makes 30-day retention affordable — you are paying object-storage prices for the bulk and index prices only for the label set. The discipline it demands: keep the label set small and bounded (service, level, environment), and push everything else to structured fields queried at read time.

Trace store. Tempo or Jaeger with tail sampling at the gateway: keep 100% of traces with errors, 100% over a latency threshold, and 1–5% of routine successes. Traces are the most voluminous and least-queried signal; sampling is not a compromise, it is the design.

Grafana + Alertmanager. Grafana is provisioned from Git — dashboards are JSON in a repo, deployed like any other artifact. Alertmanager routes on a mandatory team label, applies inhibition (node down suppresses the twenty service alerts on that node), and everything pages through the on-call tool, never a chat channel.

Retention economics

The budget conversation goes better with a table. Approximate relative cost per GB-month, normalized to object storage = 1:

Tier	Media	Relative cost	What belongs there
Hot	Local NVMe/SSD	8–15x	Last 15–30d metrics, active log index
Warm	Object storage	1x	Log chunks, trace blocks, metric blocks
Rollup	Object storage (downsampled)	~0.1x of raw	5m/1h metric aggregates, 13 months
Archive	Cold object tier	~0.4x of standard	Compliance logs, restore-only

Two consequences worth internalizing. First, downsampling is the only way long metric retention is rational: 1-hour rollups are roughly two orders of magnitude smaller than 15-second raw data, and capacity planning does not need 15-second resolution from last March. Second, the expensive tier is sized by ingest rate, not retention — the fastest way to cut cost is to stop collecting debug logs from production and drop high-cardinality labels at the gateway, not to shave retention days.

Security model

Collectors authenticate to gateways (mTLS); backends accept writes only from gateways. Nothing pushes straight to storage.
Redaction at the gateway: tokens, secrets, and PII patterns are scrubbed before storage, because deleting from immutable chunks later is somewhere between painful and impossible.
Read access is scoped: Grafana teams map to label-based data access where multi-tenancy matters; audit-relevant logs get a bucket with object lock.
The platform monitors itself from a separate minimal stack — a dead observability platform must not be the reason you find out late.

Tradeoffs

Decision	What you gain	What it costs
Three specialized backends vs one vendor platform	Right storage per signal, no per-GB vendor pricing	Three systems to operate and upgrade
Tail sampling vs keep-everything traces	~90%+ storage reduction, keeps what you investigate	Gateway buffering complexity; no complete record
Label-only log indexing (Loki)	Cheap long retention	Full-text needle searches are slower than ELK-class
Gateway tier in the path	One place for policy, redaction, buffering	Extra hop; a component that can itself fail
Dashboards as code	Trusted, reviewable, rebuildable	Friction for casual dashboard edits
13-month downsampled retention	Year-over-year planning data	Rollups cannot answer fine-grained historical questions

Scaling and variations

Lab / small team: single-binary Loki, one Prometheus with local 30-day retention, Grafana, no gateway tier — but keep OpenTelemetry SDKs and the Git provisioning, so growth is a topology change rather than a re-instrumentation.

Growing past 500 nodes: metrics move fully to Mimir with horizontally scaled ingesters; Kafka becomes non-optional in front of log ingestion; tracing gets its own gateway pool because tail-sampling memory scales with span throughput.

Multi-region: collectors per region, region-local hot storage, one query federation layer. Ship rollups, not raw data, across regions — WAN bills for telemetry replication are a self-inflicted wound.

Regulated environments: add the archive tier with object lock and a documented retrieval runbook. Retrieval matters: an archive you cannot restore from within the audit window is a compliance finding with extra storage costs.

Operations notes

Monthly alert review: every page from the last month gets one of three verdicts — actioned (keep), tuned (adjust threshold/routing), or deleted. Alert count going down while coverage holds is the platform improving.
Dashboard governance: each dashboard has an owner and a purpose line; anything unviewed for 90 days moves to an attic/ folder, and the attic gets emptied quarterly. Golden dashboards (per-service overview, platform health) are protected and provisioned; sandboxes are personal.
Cardinality budget per team, enforced at the gateway, reported weekly. It converts an invisible shared-resource problem into a visible quota one.
Ingest SLOs: the platform team publishes its own SLOs — scrape success, ingest lag, query latency — on the independent meta-monitoring stack.
Cost per signal per team goes on a dashboard everyone can see. Nothing reduces debug-log volume faster than a chart with team names on it. The broader philosophy behind these choices is in observability stack design; this entry is the deployable shape of it.

The thread running through every decision here is restraint. Telemetry is trivially easy to produce and quietly expensive to keep, and an observability platform is really a machine for deciding — once, centrally, and on purpose — what is worth storing and who gets woken up. The backends will be swapped out in five years. The discipline of paying only for signals someone acts on is the part worth keeping.

Frequently asked questions

Should metrics, logs, and traces live in one system or three?: Three specialized backends behind one query layer beats one do-everything database for most teams. Metrics engines, log stores, and trace stores make different indexing tradeoffs on purpose. Unify at the collection layer (OpenTelemetry) and the viewing layer (Grafana), and let each signal type use storage built for its access pattern.
How long should telemetry be retained?: Match retention to query pattern, not compliance reflex: high-resolution metrics 15–30 days, downsampled rollups 13 months for capacity planning, logs 14–30 days hot with archives to object storage if audit requires, traces 7–14 days sampled. Most queries hit the last few hours; paying SSD prices for data nobody queries is the most common observability budget leak.
What is tail-based trace sampling and why use it?: Head sampling decides at request start whether to keep a trace; tail sampling decides after completion, so it can keep every error and slow request while discarding routine successes. You retain the traces you actually investigate at a fraction of the storage. The cost is a buffering tier that must see all spans of a trace before deciding.
How do you stop alert fatigue as teams grow?: Route by ownership label so alerts reach the team that can act, page only on symptoms tied to SLOs, demote cause-based alerts to tickets or dashboards, and use inhibition rules so one outage produces one page. Then review: any alert that fires without producing action gets tuned or deleted at a monthly review.