Research
Autonomous infrastructure: how far can self-healing go?
An investigation into closed-loop infrastructure automation — from auto-remediation to AI-driven operations — and where human accountability must survive.
Executive summary
Autonomous infrastructure is the progression from monitoring that tells humans what broke, to systems that remediate themselves within engineered safety boundaries. The building blocks exist today — health-based rollbacks, auto-scaling, operator patterns in Kubernetes, runbook automation — but the gap between 'automated remediation' and 'autonomous operations' is where the interesting engineering and the real risk both live. This research program tests how far the loop can close in practice, and argues that the limiting factor is not AI capability but accountability design: automation may act, but a human must always own the outcome.
The question
Every operations organization runs the same ladder: monitoring tells you something broke; runbooks tell a human what to do about it; automation does the runbook’s mechanical steps; and at the top — mostly aspirational — the system notices, decides, and fixes without anyone waking up.
The question this program investigates is not whether the top rung is possible. Narrow versions of it run in production everywhere already: Kubernetes has been restarting failed containers and rescheduling workloads off dead nodes for a decade, and a health-gated deploy that rolls itself back is closed-loop remediation by any honest definition. The question is how far the loop can close before the failure modes of the automation exceed the failures it remediates — and what has to be true of the architecture when it does.
Method
Three tracks, deliberately small and instrumented:
-
Lab implementation. In my own environment I’m building out the remediation ladder explicitly: alert → runbook → supervised automation → unsupervised automation, one failure class at a time (disk pressure, certificate expiry, unhealthy workloads, node loss). Each promotion up the ladder requires a written safety case: reversibility, blast radius, and the metric that would reveal the automation itself misbehaving.
The safety case, as I’m using the term, fits on one page: the trigger condition and its known false-positive modes; the action and its worst-case blast radius; the reversal procedure, with a measured time to reverse; the action budget; and the telemetry that would show the remediator itself going wrong. If the page cannot be written, the automation is not ready. The exercise has killed more of my own proposals than any design review would have, which I take as evidence it is doing its job.
-
Failure catalog. A growing collection of public post-mortems and practitioner accounts where automation caused or amplified the outage — remediation storms, feedback loops between auto-scalers, health checks that lied, cascading restarts. The patterns repeat with striking regularity, which suggests they are architectural, not incidental.
-
Accountability mapping. For each automated action class, answering the question organizations skip: when this fires wrong at 3 a.m., who owned the decision? This track connects directly to my AI integrity work — the same principle, applied to operations: capability may be delegated to machines; accountability may not.
Early findings
Reversibility is the real boundary, not intelligence. The useful question about any proposed automation is not “is the model smart enough” but “is the action reversible at machine speed?” Every incident in my failure catalog that turned automation into catastrophe crossed a one-way door without a human. The working classification so far:
| Action class | Reversible at machine speed? | Posture |
|---|---|---|
| Restart workload, reschedule off a node | Yes — seconds | Automate freely, within budget |
| Roll back a deploy | Yes — if artifacts and schema allow | Automate behind health gates |
| Shift traffic between sites/pools | Yes — minutes, rate-limited | Automate with gradual ramps |
| Scale down, reclaim resources | Partially — state decides | Supervised automation only |
| Delete data, rotate credentials, revoke identity | No | Human decision, logged owner |
| Fail over a database on async replication | No — divergence is permanent | Human decision, rehearsed |
| Isolate a network segment or host | No — destroys evidence, halts business | Human decision, pre-negotiated authority |
The table is deliberately boring. That is the point: the boundary between “automate” and “ask a human” should be an engineering classification, not a per-incident judgment call made at 3 a.m. by whoever is awake.
Remediation needs budgets, like error budgets. The nastiest automation failures are storms: the remediator fixing the same symptom in a loop, faster than humans can notice, each “fix” feeding the trigger. The countermeasure is embarrassingly simple and almost never implemented — an action budget. If the automation restarts more than N workloads in M minutes, it halts itself and pages a human, on the reasoning that a remediation running that hot has become the incident. My lab implementation treats budget exhaustion as a first-class alert, and it has already caught one honest bug in my own controller logic.
Diagnosis automates worse than remediation. Counterintuitively, the structured diagnostic method resists automation more than the fix does. Automation is excellent at response (symptom → known action) and poor at investigation (novel symptom → cause). Language models are changing this faster than any prior technology — an LLM that reads the same telemetry an engineer would and proposes a ranked hypothesis list is genuinely useful today — but proposing hypotheses to a human and acting on them autonomously are different safety classes, and the temptation to collapse them is where I expect the industry’s next generation of self-inflicted outages.
Observability is the ceiling. Nothing can be safely automated that cannot be precisely observed; the telemetry architecture sets the upper bound on trustworthy autonomy. Teams that want autonomous operations but haven’t instrumented decision-grade signals are asking the automation to act on data a human would refuse to act on.
Open questions
- Can the safety case for an automation be tested continuously — chaos engineering aimed at the remediators themselves — rather than argued once in a design review?
- Where does LLM-driven remediation planning fit the reversibility rule? A generated plan is novel by construction, which breaks the “known action” assumption that makes response automation safe.
- What does the on-call role become when the loop mostly closes? My hypothesis: it shifts from executing recovery to auditing the machine’s recoveries — which is a different skill, staffed and trained differently.
The end state I am willing to defend so far: infrastructure that heals itself inside engineered, budgeted, reversible boundaries — with humans owning every boundary crossing, and with the automation’s own behavior watched as skeptically as the systems it tends.
None of this is really about AI. It is the same principle that governs interlocks in a plant and circuit breakers in a distribution panel: autonomy is granted by boundary design, not by capability. The models will keep getting more capable; the boundaries are still ours to engineer, and whether we engineer them is the whole question. Updates as the lab work and the failure catalog grow; disagreements welcome via the contact page.
Frequently asked questions
- What is self-healing infrastructure?
- Self-healing infrastructure detects failures and restores service without a human in the loop — restarting failed workloads, replacing unhealthy nodes, rolling back bad deploys, shifting traffic away from degraded dependencies. Kubernetes controllers, health-gated deployments, and auto-remediation runbooks are the mainstream building blocks; the hard engineering is bounding what they are allowed to do.
- What should never be fully automated in infrastructure operations?
- Actions that are irreversible, safety-relevant, or destroy evidence: data deletion, credential and identity changes, security containment decisions with business impact, and anything touching physical processes. The engineering rule: automation acts freely inside reversible boundaries; crossing an irreversible boundary requires a human decision that someone accountably owns.
- Is AIOps the same as autonomous infrastructure?
- No. AIOps as commonly sold is anomaly detection and alert correlation — it informs humans faster. Autonomous infrastructure closes the loop: the system takes the remediation action itself. The second requires safety engineering (blast-radius limits, action budgets, kill switches) that detection products don't need.