Reference Architecture
Observability Platform Architecture
Reference architecture for observability: metrics, logs, and trace pipelines, tiered storage, retention economics, alert routing, and dashboard governance.
Design summary
An observability platform is the shared pipeline, storage, and alerting infrastructure that turns telemetry from every system into signals engineers can act on. This architecture uses OpenTelemetry collection at the edge, Prometheus-compatible metrics with a long-term store, Loki-class log aggregation on object storage, and tail-sampled tracing — with retention tiers chosen by query pattern rather than habit. It covers the pipeline topology, why storage economics should drive retention policy, how alert routing stays sane as teams multiply, and the governance that keeps dashboards from becoming a graveyard.
Component stack
- OpenTelemetry Collector (agent + gateway tiers)
- Prometheus + Thanos or Mimir (long-term metrics)
- Loki (log aggregation) on S3-compatible object storage
- Tempo or Jaeger (traces, tail sampling)
- Grafana (dashboards, provisioned as code)
- Alertmanager (routing, inhibition, silences)
- Kafka or in-collector buffering for log surge protection
- PagerDuty/Opsgenie-class on-call integration
Purpose and requirements
Observability is a platform, not a per-team tool choice.
Every organization I have worked with eventually rediscovers this the same way: each team runs its own Prometheus and its own log stack, and incident response becomes a tour of seven UIs with seven retention policies. This architecture is the shared platform that replaces that — sized for a mid-size estate (tens of services, 50–500 nodes) run by a single platform team.
Requirements:
- One pipeline in, three signals stored. Applications instrument once (OpenTelemetry); the platform routes metrics, logs, and traces to the right backends.
- Costs scale sub-linearly with telemetry volume. Storage tiers and sampling are designed in, not retrofitted after the first invoice shock.
- Alerts reach owners, and only owners. Routing is driven by labels, not by a human dispatcher reading a shared channel.
- The platform outlives its dashboards. Governance keeps content trusted; everything is provisioned from Git.
Topology
SERVICES / NODES (per-host or sidecar)
app SDKs (OTel) ---> [ OTel Collector: AGENT tier ]
node exporters | batching, k8s metadata,
log files (tail) | local buffering
v
[ OTel Collector: GATEWAY tier (x2+) ]
sampling decisions, redaction, routing,
surge buffer (memory/Kafka)
| | |
metrics| logs| traces|
v v v
+----------------+ +-----------+ +-------------+
| Prometheus | | Loki | | Tempo |
| (15-30d local) | | ingesters | | tail-sample |
+-------+--------+ +-----+-----+ +------+------+
| | |
v v v
+-----------------------------------------------+
| OBJECT STORAGE (S3-compatible) |
| Thanos/Mimir blocks Loki chunks traces |
| downsampled rollups (14-30d hot, (7-14d) |
| (13 months) archive after) |
+-----------------------------------------------+
|
v
[ Grafana ] <--- dashboards provisioned from Git
[ Alertmanager ] --routes by team label--> on-call
(inhibition, silences) (PagerDuty-class)
Component roles
Collector agent tier. An OpenTelemetry Collector on every node (DaemonSet in Kubernetes) tails logs, receives spans and metrics from app SDKs, scrapes exporters, and enriches everything with infrastructure metadata — namespace, node, owning team. Enrichment happens here because this is the last place the context is cheap to know.
Collector gateway tier. Two or more central collectors where policy lives: tail-sampling decisions, PII redaction, per-tenant rate limits, and routing to backends. Splitting agent from gateway means policy changes are a gateway rollout, not a fleet-wide restart. A surge buffer (in-memory queues, or Kafka once log volume justifies it) sits here so a backend outage causes delay, not data loss — an outage is precisely when you most need the logs that were just generated.
Metrics store. Prometheus (or an agent-mode scraper) holds 15–30 days of raw resolution locally; Thanos or Mimir compacts and downsamples into object storage for 13-month rollups — long enough to see two year-ends for capacity planning. Cardinality is the failure mode to engineer against: per-request labels, user IDs, and unbounded pod-name explosions will take the store down long before disk fills. My cardinality control note covers the enforcement tactics; the architectural point is that the gateway tier is where you can drop offending labels for everyone at once.
Log store. Loki-class: index only labels, keep log bodies as compressed chunks in object storage. This is what makes 30-day retention affordable — you are paying object-storage prices for the bulk and index prices only for the label set. The discipline it demands: keep the label set small and bounded (service, level, environment), and push everything else to structured fields queried at read time.
Trace store. Tempo or Jaeger with tail sampling at the gateway: keep 100% of traces with errors, 100% over a latency threshold, and 1–5% of routine successes. Traces are the most voluminous and least-queried signal; sampling is not a compromise, it is the design.
Grafana + Alertmanager. Grafana is provisioned from Git — dashboards are
JSON in a repo, deployed like any other artifact. Alertmanager routes on a
mandatory team label, applies inhibition (node down suppresses the twenty
service alerts on that node), and everything pages through the on-call tool,
never a chat channel.
Retention economics
The budget conversation goes better with a table. Approximate relative cost per GB-month, normalized to object storage = 1:
| Tier | Media | Relative cost | What belongs there |
|---|---|---|---|
| Hot | Local NVMe/SSD | 8–15x | Last 15–30d metrics, active log index |
| Warm | Object storage | 1x | Log chunks, trace blocks, metric blocks |
| Rollup | Object storage (downsampled) | ~0.1x of raw | 5m/1h metric aggregates, 13 months |
| Archive | Cold object tier | ~0.4x of standard | Compliance logs, restore-only |
Two consequences worth internalizing. First, downsampling is the only way long metric retention is rational: 1-hour rollups are roughly two orders of magnitude smaller than 15-second raw data, and capacity planning does not need 15-second resolution from last March. Second, the expensive tier is sized by ingest rate, not retention — the fastest way to cut cost is to stop collecting debug logs from production and drop high-cardinality labels at the gateway, not to shave retention days.
Security model
- Collectors authenticate to gateways (mTLS); backends accept writes only from gateways. Nothing pushes straight to storage.
- Redaction at the gateway: tokens, secrets, and PII patterns are scrubbed before storage, because deleting from immutable chunks later is somewhere between painful and impossible.
- Read access is scoped: Grafana teams map to label-based data access where multi-tenancy matters; audit-relevant logs get a bucket with object lock.
- The platform monitors itself from a separate minimal stack — a dead observability platform must not be the reason you find out late.
Tradeoffs
| Decision | What you gain | What it costs |
|---|---|---|
| Three specialized backends vs one vendor platform | Right storage per signal, no per-GB vendor pricing | Three systems to operate and upgrade |
| Tail sampling vs keep-everything traces | ~90%+ storage reduction, keeps what you investigate | Gateway buffering complexity; no complete record |
| Label-only log indexing (Loki) | Cheap long retention | Full-text needle searches are slower than ELK-class |
| Gateway tier in the path | One place for policy, redaction, buffering | Extra hop; a component that can itself fail |
| Dashboards as code | Trusted, reviewable, rebuildable | Friction for casual dashboard edits |
| 13-month downsampled retention | Year-over-year planning data | Rollups cannot answer fine-grained historical questions |
Scaling and variations
Lab / small team: single-binary Loki, one Prometheus with local 30-day retention, Grafana, no gateway tier — but keep OpenTelemetry SDKs and the Git provisioning, so growth is a topology change rather than a re-instrumentation.
Growing past 500 nodes: metrics move fully to Mimir with horizontally scaled ingesters; Kafka becomes non-optional in front of log ingestion; tracing gets its own gateway pool because tail-sampling memory scales with span throughput.
Multi-region: collectors per region, region-local hot storage, one query federation layer. Ship rollups, not raw data, across regions — WAN bills for telemetry replication are a self-inflicted wound.
Regulated environments: add the archive tier with object lock and a documented retrieval runbook. Retrieval matters: an archive you cannot restore from within the audit window is a compliance finding with extra storage costs.
Operations notes
- Monthly alert review: every page from the last month gets one of three verdicts — actioned (keep), tuned (adjust threshold/routing), or deleted. Alert count going down while coverage holds is the platform improving.
- Dashboard governance: each dashboard has an owner and a purpose line;
anything unviewed for 90 days moves to an
attic/folder, and the attic gets emptied quarterly. Golden dashboards (per-service overview, platform health) are protected and provisioned; sandboxes are personal. - Cardinality budget per team, enforced at the gateway, reported weekly. It converts an invisible shared-resource problem into a visible quota one.
- Ingest SLOs: the platform team publishes its own SLOs — scrape success, ingest lag, query latency — on the independent meta-monitoring stack.
- Cost per signal per team goes on a dashboard everyone can see. Nothing reduces debug-log volume faster than a chart with team names on it. The broader philosophy behind these choices is in observability stack design; this entry is the deployable shape of it.
The thread running through every decision here is restraint. Telemetry is trivially easy to produce and quietly expensive to keep, and an observability platform is really a machine for deciding — once, centrally, and on purpose — what is worth storing and who gets woken up. The backends will be swapped out in five years. The discipline of paying only for signals someone acts on is the part worth keeping.
Frequently asked questions
- Should metrics, logs, and traces live in one system or three?
- Three specialized backends behind one query layer beats one do-everything database for most teams. Metrics engines, log stores, and trace stores make different indexing tradeoffs on purpose. Unify at the collection layer (OpenTelemetry) and the viewing layer (Grafana), and let each signal type use storage built for its access pattern.
- How long should telemetry be retained?
- Match retention to query pattern, not compliance reflex: high-resolution metrics 15–30 days, downsampled rollups 13 months for capacity planning, logs 14–30 days hot with archives to object storage if audit requires, traces 7–14 days sampled. Most queries hit the last few hours; paying SSD prices for data nobody queries is the most common observability budget leak.
- What is tail-based trace sampling and why use it?
- Head sampling decides at request start whether to keep a trace; tail sampling decides after completion, so it can keep every error and slow request while discarding routine successes. You retain the traces you actually investigate at a fraction of the storage. The cost is a buffering tier that must see all spans of a trace before deciding.
- How do you stop alert fatigue as teams grow?
- Route by ownership label so alerts reach the team that can act, page only on symptoms tied to SLOs, demote cause-based alerts to tickets or dashboards, and use inhibition rules so one outage produces one page. Then review: any alert that fires without producing action gets tuned or deleted at a monthly review.