Skip to content
PAVEL GLUKHIKH
Menu

Infrastructure · Pillar Guide

Kubernetes in production: what it is and when it earns it

Kubernetes explained as a control loop over desired state: when it earns its complexity, the production-readiness territory, and what operating it costs.

7 min read

Executive summary

Kubernetes is a control loop over desired state: you declare what should exist, and controllers continuously reconcile reality toward that declaration. That single idea explains both its power, self-healing, portability, a uniform API for infrastructure, and its cost, a distributed system you must now operate underneath your applications. This page is the orientation point for the Kubernetes cluster on this site: what the system actually is, when it earns its complexity and when it does not, the territory map of production readiness, and the operational reality of day two. It ends with reading paths into the detailed architecture, troubleshooting, and operations material.

Kubernetes is a control loop over desired state. You declare what should exist, five replicas of this container image, this much CPU and memory, reachable at this name, and a set of controllers continuously compares that declaration against reality and acts on the difference. A container dies; the loop restarts it. A node fails; the loop reschedules its workloads elsewhere. You change the declaration; the loop rolls the change out.

That is the entire idea. Pods, services, deployments, operators, admission controllers, the whole vocabulary that makes Kubernetes feel enormous, is elaboration on that one reconciliation loop, applied to more kinds of state.

I have spent years writing Kubernetes guides on Medium and running clusters in environments from enterprise platforms to the production-grade lab in my own datacenter, and the single most useful thing I can tell a newcomer is this: stop memorizing resource types and internalize the loop. Every behavior that seems mysterious, why a pod is Pending, why a rollout stalls, why deleting a resource brings it back, becomes legible the moment you ask “which controller owns this state, and what is it reconciling against?”

What the control loop buys you, and what it costs

The reconciliation model is genuinely valuable, and it is worth being precise about why, because the value is not “containers.”

Self-healing by default. Recovery is not a script someone wrote; it is the system’s steady-state behavior. Design for recovery rather than perfection is a principle I push everywhere, and Kubernetes has it built into its bones.

A uniform API over infrastructure. Compute, networking, storage, and configuration are all declared through one API with one authentication model and one audit surface. That uniformity is what makes platform teams possible at scale: the same manifests, reviewed the same way, deployed to any conformant cluster.

Declarative operations. The cluster’s desired state can live in version control, which makes changes reviewable, revertible, and reconstructable. This is the property that connects Kubernetes to the broader infrastructure-as-code operating model: the win is not the YAML, it is that the change process becomes an engineering process.

Now the cost, stated just as plainly. Kubernetes is itself a distributed system, with a consensus store, a scheduler, controller managers, a software-defined network, and its own failure modes, and it sits underneath your applications. Every layer you add is a layer that can fail, and this one fails in ways that require real understanding to diagnose. You are not adopting a tool. You are adopting an operating model and the obligation to staff it.

When it earns its complexity

Complexity is rarely accidental. It is usually purchased, and the question is whether you needed what it bought.

Kubernetes earns its keep when several of these are true at once: many services with independent lifecycles, multiple teams shipping to shared infrastructure, real availability requirements that demand automated recovery, workload density worth bin-packing, or a genuine multi-environment story where the same API on-premises and in cloud matters. Notice these are organizational properties as much as technical ones. Kubernetes is a platform for platform teams; its leverage compounds with the number of teams standing on it.

It does not earn its keep for a monolith with modest traffic, a three-person team, a workload that a VM and a systemd unit have served faithfully for years, or a database that was happy where it was. The fashionable pressure to run everything on Kubernetes is product thinking, not systems thinking. A boring VM that nobody has touched in two years is not technical debt; it is often the best-engineered thing in the building.

The same discipline applies to where the clusters run. Managed control planes remove real operational burden and are the right default for most organizations, but the cloud-versus-on-premises question is a workload-level economic decision, not an identity. I keep the full argument in the cloud vs on-premises decision framework.

Deciding honestly requires knowing what production operation actually involves. Which brings us to the map.

The production-readiness territory

“Production-ready Kubernetes” is a territory, not a checkbox. These are its regions, and any credible platform has a deliberate answer in each.

TerritoryThe question it answersWhere it goes wrong
Control plane & etcdDoes the cluster survive node and quorum failures?Untested backups; single-member etcd discovered during an outage
Networking & ingressHow does traffic enter, and what may talk to what?Flat pod networks with no policy; ingress as an afterthought
StorageWhat happens to data when a pod moves?Stateful workloads on storage nobody benchmarked or backs up
Resource managementWhat may a workload consume?No requests/limits; the first noisy neighbor evicts the wrong pod
SecurityWho and what is trusted, and how is it enforced?Cluster-admin everywhere; images from anywhere; no admission policy
ObservabilityCan you see the platform and the workloads?Dashboards for apps, blindness for the control plane
DeliveryHow do changes reach the cluster?kubectl apply from laptops; drift nobody can explain
UpgradesHow do you stay current without downtime?Versions frozen for two years; the upgrade becomes a migration

The detailed walk through this table, what actually changes between a tutorial cluster and one you can defend in an architecture review, is in Production Kubernetes Architecture: What Actually Changes, and the assembled reference design, components chosen and justified, is in the Production Kubernetes Platform entry of the architecture library.

Two regions deserve emphasis because they are where I see the most expensive surprises.

etcd is the cluster. Everything else is reconstructable from it; nothing reconstructs it. A team that has never restored etcd from backup does not have backups, it has hope. The procedure is short enough that there is no excuse; I keep a working runbook in etcd backup and restore on kubeadm clusters.

Observability must cover the platform, not just the apps. Kubernetes failures frequently present as application symptoms, latency, restarts, mysterious 502s, with platform causes: pressure on a node, a misbehaving webhook, DNS. If your telemetry stops at the application boundary, every platform incident begins with a guessing phase. The design for a stack that covers both layers without unbounded cost is in Observability Stack Design.

The operations reality

Everything becomes operations. Kubernetes stopped being exciting years ago, which is precisely why it is now safe to depend on, and the teams that succeed with it are the ones that treat day two as the actual job.

The recurring operational rhythms: upgrades arrive roughly three times a year and cannot be deferred indefinitely, because version skew policies and API deprecations turn neglect into forced migration. Certificates expire. Node pools need patching and replacement. Capacity needs re-measuring as workloads drift from their original requests. None of this is difficult individually; collectively it is a standing workload that someone must own by name.

And things break. When they do, the difference between a twenty-minute incident and a four-hour one is rarely knowledge of some obscure flag; it is method. Kubernetes failures span layers, application, pod, node, network, control plane, and unsystematic debugging bounces between them until luck intervenes. The systematic version, work the layers in order, let the reconciliation loop tell you what it is stuck on, is the subject of Kubernetes Troubleshooting: A Method That Always Finds It.

One more honest observation: the skills are learnable but not free, and the best place to pay the tuition is not production. A lab cluster you deliberately break and repair, restore etcd, kill nodes, fill disks, teaches the failure modes at zero blast radius. I have run mine that way for years; the approach is in Building a Home Lab With Production Discipline.

Where to go deeper

Reading paths through the cluster this page anchors.

If you are architecting a production platform:

  1. Production Kubernetes Architecture: What Actually Changes — the territory table above, expanded into design decisions.
  2. Production Kubernetes Platform — the reference architecture: concrete component choices with rationale.
  3. Infrastructure as Code Is an Operating Model, Not a Tool — the delivery discipline the platform should sit inside.

If you operate clusters today:

  1. Kubernetes Troubleshooting: A Method That Always Finds It — the layered diagnostic method, worth internalizing before the next incident rather than during it.
  2. etcd backup and restore on kubeadm clusters — the runbook for the one component you cannot lose.
  3. Observability Stack Design — platform-and-workload telemetry with the cost model attached.

If you are still deciding:

The longer view

Orchestrators came before Kubernetes and something will eventually come after it. The ideas it standardized will outlive it: desired state in version control, reconciliation as the recovery mechanism, a uniform API over heterogeneous infrastructure, health as something the system maintains rather than something operators restore.

Learn those ideas rather than the tool and the next platform is a vocabulary change. Adopt the tool without the ideas and you get what the industry has too much of already: complex clusters running workloads that never needed them, operated by teams who were promised the platform would run itself.

Technologies keep changing. The engineering principles behind reliable systems rarely do.

Frequently asked questions

What is Kubernetes, in plain terms?
Kubernetes is a system that continuously reconciles running infrastructure toward a declared desired state. You submit declarations, run five replicas of this container, expose them here, and controllers watch reality, compare it to the declaration, and act on the difference. Containers restart when they die, workloads reschedule when nodes fail. Everything else, pods, services, operators, is elaboration on that one reconciliation loop.
When should you use Kubernetes, and when is it overkill?
Use it when you have many services, multiple teams shipping independently, real availability requirements, and the staffing to operate a platform, or a managed offering plus at least a couple of engineers who genuinely understand it. Skip it for a monolith with modest traffic, a small team, or workloads a VM and a systemd unit serve fine. The honest test: if the operational investment does not obviously pay for itself, it will not.
What does production-ready Kubernetes actually mean?
It means the failure paths have been engineered, not just the happy path. Concretely: a highly available control plane with tested etcd backup and restore, ingress and network policy that reflect a real security model, persistent storage with a recovery story, resource requests and limits set from measurement, observability into both the platform and workloads, and a delivery path where cluster and app changes are declarative and reviewed.
Is managed Kubernetes (EKS, GKE, AKS) enough for production?
It removes real work, control-plane hosting, etcd operations, upgrades of the masters, and for most organizations it is the right default. It does not remove the larger share: workload architecture, resource management, networking and ingress, storage, observability, security policy, and upgrades of everything you run on the cluster. Managed Kubernetes is a better foundation, not an operations department. Day-two ownership stays with you.
How hard is Kubernetes to operate, really?
The learning curve is front-loaded and real: a new vocabulary, a distributed control plane, and failure modes that span layers. Once a team internalizes the reconciliation model and builds a systematic troubleshooting habit, operations become routine rather than heroic. The teams that struggle long-term are the ones that adopted it without anyone owning the platform, treating the cluster as a deployment target instead of a system that itself needs engineering.

References

Related reading