Skip to content
PAVEL GLUKHIKH
Menu

Infrastructure

Infrastructure as Code Is an Operating Model, Not a Tool

Infrastructure as code beyond the tooling: repo structure, review gates, drift management, state security, and CI/CD pipelines that make IaC trustworthy.

5 min read

Executive summary

Infrastructure as code is the practice of defining infrastructure in version-controlled, reviewable, machine-applied definitions — and the tool is the smallest part of it. What decides whether IaC delivers is the operating model around it: how repositories are structured, what a review must prove before apply, how drift is detected and triaged, how state files are secured, and whether a pipeline or a laptop is the thing allowed to change production. After years of running infrastructure teams and estates that grew faster than their habits, this is the operating model I have seen hold up.

The tool is the easy 20%

Every infrastructure-as-code initiative I have walked into mid-flight had already picked a tool, and the tool was rarely the problem. Terraform or OpenTofu, CloudFormation, Pulumi, Ansible — all of them describe infrastructure competently. What the struggling initiatives were missing was everything around the tool: nobody could say which repo owned a given subnet, reviews were rubber stamps because the plan output never appeared in the pull request, production had drifted far enough that applies were feared, and three engineers still held admin credentials “for emergencies” that got used weekly.

Adopting the tool is deployment. The operating model is the operations — and as always, the operations are the part that decides.

Infrastructure as code succeeds as an operating model: repository structure, review gates, drift discipline, state security, and a pipeline that is the only thing allowed to touch production. Across the teams and estates I have run, the model below is the one that held up as both grew.

Repo structure: partition by blast radius

The structural question is not monorepo versus polyrepo. Both work. The question is what a single bad merge can destroy, and the structure should make the answer “one environment, one domain” — never “everything.”

infra-modules/               # versioned, reusable building blocks
  network/  vpc/  k8s-cluster/  ...

infra-live/                  # environment roots (compositions)
  prod/
    us-east/ network/  ← own state
             platform/ ← own state
  staging/
    ...

The rules that matter:

  • Modules are versioned and pinned. Environment roots reference network v3.2.1, not main. Upgrading a module in staging before prod becomes a one-line diff instead of a leap of faith.
  • State splits along blast-radius lines — per environment, per region, per domain (network vs. platform vs. data). A monolithic state means every plan risks everything and every lock blocks everyone. Over-fragmented state turns one change into eight PRs. Split where teams and failure domains split, and stop there.
  • Environments differ by variables, not by structure. The moment staging and prod have structurally different code, staging stops validating anything. It is the same reasoning that makes a production-shaped lab valuable, applied to every environment tier.
  • Somewhere, an ownership map: which root owns which real-world thing. Two roots convinced they own the same firewall rule set is a slow-motion collision, and a README table prevents it.

Review gates: the plan is the diff that counts

A pull request that shows HCL changes but not the resulting plan is asking reviewers to compile the change in their heads. Some will claim they can.

The gate that changes everything is mechanical: CI posts the plan output into the PR, and that plan — not the code — is what gets reviewed.

A review must prove three things:

  1. The change does what the description claims. The plan’s create/update/destroy lines match the stated intent, and nothing else appears. Anything unexplained in the plan is a question, not a footnote.
  2. Destroys are named and justified. Any destroy or replace of a stateful resource — databases, volumes, anything with data gravity — gets called out explicitly in the PR, not discovered in the apply log.
  3. Policy holds. Automated checks — policy-as-code (OPA/Sentinel-class), linting, cost estimation, secret scanning — run before a human spends attention. Machines review rules; humans review intent. Security-relevant changes (IAM, network exposure, encryption settings) route to the same scrutiny you would apply in a security architecture review.

Right-size the ceremony. A tag change and a VPC peering change do not deserve the same gate; auto-approve the no-op and low-risk classes and spend human review where the blast radius is. Review fatigue is how real mistakes get approved.

Drift: detect continuously, treat as a signal

Drift — reality diverging from code — is inevitable. Emergency console fixes, well-meaning clickops, provider-side changes: it all accumulates. The operating model question is not whether drift happens. It is whether you notice in hours or in a year.

Run a scheduled plan against every environment — nightly is fine — that alerts on any diff. Then triage every finding into exactly one of three outcomes:

  • Revert — unauthorized or accidental; the pipeline restores declared state.
  • Codify — the 2 a.m. fix was correct, so it becomes a PR within a defined window (48 hours is a workable number) and the repo reabsorbs reality.
  • Investigate — nobody claims it. An unexplained infrastructure change is a security event until proven otherwise, not a curiosity.

The cultural rule underneath matters more than the tooling: emergency console access exists, is logged, and creates an obligation to codify. Punish the emergency fix and people learn to hide drift. Make codification cheap and they learn to return it.

State is secret material

Terraform-class state files contain resource attributes — historically including credentials and connection strings in plaintext — plus a complete, accurate map of your estate. Recent versions handle secrets better, but the safe operating assumption has not changed.

Whoever reads state can attack you. Whoever writes state can redefine your infrastructure.

The minimum bar: an encrypted remote backend with locking and versioning enabled, access scoped per environment — the CI role that plans staging cannot read prod state — state access logged, and state files banned from git, laptops, and chat by policy and by .gitignore. Versioned state is also your recovery path, both for corruption and for the classic incident where someone ran state rm with conviction.

The pipeline is the only operator

The end state of the operating model is simple to say: humans write and review; only the pipeline applies. Merge to the environment branch or root, and CI plans, gates, applies, and records — using per-environment credentials that individual engineers do not hold. This is the GitOps principle applied honestly: the repository is the interface to production, and everything else is break-glass.

For the pipeline to deserve that role, it needs: short-lived cloud credentials via OIDC federation rather than long-lived keys parked in CI; plan and apply as separate stages, with the apply gated on the reviewed plan artifact — not a fresh plan that might quietly differ; environment promotion in order; and an audit trail linking every production change to a commit, a reviewer, and a pipeline run. When an auditor asks who changed the firewall and why, the answer is a URL. Having sat on the receiving end of enough audits and enough incidents, I can report that “the answer is a URL” is worth every hour the model costs. The same trail is what lets a hybrid estate stay coherent when workloads span datacenters and cloud, as covered in the cloud vs on-premises framework.

What to write down

  • The repo and state map: which root owns which slice of reality.
  • The review gate definition per change class, including what auto-approves.
  • The drift SLA: detection cadence, triage owner, codification window.
  • State backend policy: encryption, locking, who can read which state.
  • The break-glass procedure: who may bypass the pipeline, how it is logged, and what debt it creates.

Adopting a tool takes a sprint. Building the operating model takes a couple of quarters of habit. The difference between the two is the difference between “we have some Terraform” and infrastructure your team can change on a Friday afternoon without anyone holding their breath — which is what infrastructure looks like when it is working: unremarkable, predictable, and very nearly invisible.

Frequently asked questions

What is infrastructure as code in practice?
Defining servers, networks, and services in declarative files — Terraform or OpenTofu, CloudFormation, Ansible, Kubernetes manifests — that live in version control, get reviewed like application code, and are applied by automation. The payoff is reproducibility and an audit trail: the repository, not an engineer's memory, becomes the source of truth for what production is supposed to look like.
How should Terraform repositories be structured?
Split by blast radius and rate of change, not by resource type. Shared reusable modules live in their own versioned repo or directory; environment roots — prod, staging, per-region — are small compositions that pin module versions, each with its own state. The test is simple and worth writing down: a bad merge to any single root must be incapable of touching more than one environment's state.
How do you handle infrastructure drift?
Detect it continuously — a scheduled plan against every environment that reports any diff — then triage every finding: revert unauthorized changes, codify legitimate emergency fixes back into the repo, and investigate anything nobody claims. Drift you tolerate silently compounds until the plan output is so noisy that applies become frightening, and frightened teams stop applying. That is how IaC adoption quietly dies.
Why are Terraform state files a security risk?
State files record every attribute of managed resources, and despite improvements in recent versions they can still hold secrets and sensitive values in plaintext — alongside a complete, accurate map of your infrastructure. Treat state as secret material: encrypted remote backend, locking enabled, access scoped tightly per environment, versioned so you can recover, and never committed to git or passed around in chat.

References

Related reading