Restart Cannot Fix Overload — pete.lostsource.net

There is a particular kind of incident where the system spends its energy trying to fix itself, fails, and then spends more energy. The fix the system reaches for is real. It just operates on the wrong layer.

A few days ago I watched one of my services restart itself every couple of minutes. The container runtime kept declaring it unhealthy. Each restart added a chunk of cold-start work to a host that was already hot. The signal that triggered the restart was technically correct — something was slow. The action it triggered — kill the process, start a new one — addressed none of it.

The probe was the bug.

What the probe was actually measuring¶

The healthcheck endpoint did what a lot of healthcheck endpoints do: it answered a deep readiness question. Can I serve real traffic? To answer that honestly, it walked through some live counts against a large local database. Under normal conditions the walk completes in a few hundred milliseconds. Under thermal throttle, with the CPU sitting at junction temperature and concurrent workloads fighting for the same cores, the walk slowed to several seconds.

The container runtime’s healthcheck had a five-second timeout. Three consecutive failures meant unhealthy. An autoheal sidecar saw unhealthy and did what autoheal sidecars do — docker restart.

The new process came up. It started serving. The healthcheck queries started running again. The host was still hot. The queries still took several seconds. Three failures, restart, repeat.

Nothing the process did from inside its own boundary could change the temperature of the silicon it was running on. The restart loop was a perfectly executed answer to the wrong question.

Liveness and readiness are different questions¶

Kubernetes formalized this distinction years ago, and the docs are explicit about it¹:

A liveness probe answers: is this process stuck in a way that a restart would fix? Deadlock. Wedged event loop. Memory corruption you can’t recover from. The kill-and-restart action has to actually address the failure mode.
A readiness probe answers: should this instance receive traffic right now? Dependencies loading. Cache warming. Downstream service unavailable. The action here is to stop sending requests, not to restart.
A startup probe (added in 1.16 as alpha, stable in 1.20)² answers: has initialization finished? — separated out because slow-starting apps were getting killed by liveness probes before they ever became live, producing an infinite restart loop³.

The point is not the names. The point is that each probe corresponds to a different recovery action, and using the wrong probe for the wrong question is what generates cascades.

Tim Hockin, who designed the probe API, has been clear about this for years⁴. The community guidance has been clear for years. Henning Jacobs at Zalando wrote the canonical “liveness probes are dangerous” piece back in 2019 and it still reads like a fresh warning: “A Liveness Probe in combination with an external DB health check dependency is the worst situation: a single DB hiccup will restart all your containers!”⁵

The Kubernetes docs themselves now carry explicit cascading-failure language: “Incorrect implementation of liveness probes can lead to cascading failures. This results in restarting of container under high load; failed client requests as your application became less scalable; and increased workload on remaining pods…”⁶

None of this is new.

The non-Kubernetes version of the problem¶

What bit me wasn’t running in Kubernetes. It was running in plain Docker Compose with a sidecar that watches healthcheck status and restarts unhealthy containers — the willfarrell/autoheal pattern that exists because Docker itself has never natively shipped restart-on-unhealthy behavior. The original moby issue requesting it has been open since 2016⁷. The autoheal container has filled the gap for nearly a decade, currently sitting at over 100M pulls, still actively maintained⁸.

The trouble with that pattern is that it collapses a useful distinction. Kubernetes makes you write three different probes for three different questions. Docker Compose gives you one HEALTHCHECK field, one status, one switch on the sidecar. Whatever you measure becomes liveness by default, because the only available reaction is restart.

So you write the most informative healthcheck you can. You include the deep checks. You count dependencies. You make the endpoint useful for your dashboards. And then the same endpoint, with the same expensive queries, becomes the trigger for kill-and-restart under exactly the conditions where the queries get expensive.

The vocabulary for this exists — “shallow” versus “deep” health checks. AWS, Spring, and most of the microservices literature have been using these terms for years⁹¹⁰. A shallow check verifies the process is responsive. A deep check verifies the process can do useful work, including reaching its dependencies. They are different artifacts answering different questions, and the action they should trigger is different.

If your runtime only has one knob and that knob is “restart on failure,” the only healthcheck you can safely wire into it is a shallow one.

What restart can and cannot do¶

The mental model I want to leave for the next time I see this: every restart is an answer to a cause. Match the answer to the cause and the restart fixes the problem. Mismatch them and the restart becomes part of the load.

Restart can fix:

A process whose event loop is deadlocked.
A worker that has wedged on a corrupted cache.
A handler that has leaked memory beyond what GC can recover.
A connection pool that has gotten into an unrecoverable state.

These are all things inside the process boundary. The process is the thing the restart kills and reinitializes, so the failure has to live inside that boundary for the cure to reach it.

Restart cannot fix:

A saturated host.
A thermally throttled CPU.
A slow downstream database that everyone in the cluster shares.
A network partition.
A storage volume under contention.

None of these change when the process dies. Some of them get worse when the process dies, because the restart itself consumes the resource that was already saturated. Cold-start work piles onto a host that was already hot. Reconnection storms hit a database that was already slow. The probe that triggered the restart is going to fire again as soon as the process comes back up, because the underlying condition is unchanged.

This is the same shape as the cascading-failure pattern Google’s SRE book describes in its chapter on the subject¹¹ — a feedback loop where the recovery mechanism feeds the failure it was meant to recover from. It just happens to manifest, in this case, at the healthcheck-probe layer.

The fix is structural, not parametric¶

When I hit this, I had a tempting bad option: make the timeout looser. Go from five seconds to fifteen. Maybe twenty.

That fix preserves the architecture and merely raises the threshold where the cascade triggers. It’s a knob, not a redesign. The probe is still measuring the wrong thing, and the next time the host gets hotter or the database gets bigger or the queries get more expensive, the cascade returns.

The real fix is to separate the questions:

One endpoint for liveness — shallow, fast, in-process. Does the HTTP handler respond? Is the event loop turning? Is the process not deadlocked? Microseconds, not milliseconds. No database. No I/O outside the process. The action wired to its failure is restart, so it must only measure things restart can fix.
One endpoint for deep status — slow, cached, observable. Walk the database. Count the records. Check the upstream services. Cache the result behind a short TTL so dashboards and Prometheus scrapes don’t all trigger fresh walks at once. Surface the depth as a query parameter or a separate path so it’s clearly not the liveness contract. The action wired to its failure is page someone, not kill the process.

In Docker Compose, this means the HEALTHCHECK directive — the one autoheal watches — points at the shallow endpoint. The deep endpoint exists for human consumption and for monitoring systems that can do something useful with a slow-and-unhealthy signal, like alert. Kubernetes users get the same split for free by writing separate livenessProbe and readinessProbe configurations against separate paths.

The general principle, stripped of any particular runtime: the probe whose failure restarts something must only measure things a restart can fix.

What I’m taking from this¶

The bug was not in the database. The bug was not in the host being thermally throttled. The bug was not even in the probe being slow. The bug was that I had wired a deep readiness signal to a restart action, in a runtime that only offered one wire.

A lot of incidents look like this in retrospect. The thing that fired is doing exactly what it was configured to do. The configuration was reasonable when written. It just encoded a category error about what the recovery mechanism was actually capable of fixing.

Self-healing systems are good. Self-healing systems that act on the wrong layer are worse than no healing at all, because they consume capacity while making the problem they were meant to solve harder to diagnose. The cure has to reach the cause. If it doesn’t, the cure is part of the load.

— Pete

Kubernetes, “Configure Liveness, Readiness and Startup Probes”, official documentation. Defines each probe as answering a distinct question with a distinct recovery action. ↩
Kubernetes Enhancement Proposal #950, “Add pod-startup liveness-probe holdoff for slow-starting pods”, 2019. Alpha in 1.16, beta in 1.18, stable (GA) in 1.20 (December 2020). ↩
vCluster, “Kubernetes Startup Probes – Examples & Common Pitfalls”, February 2021. Motivation: slow-starting apps were being killed by liveness probes before initialization completed, producing infinite restart loops. ↩
Tim Hockin, “Kubernetes Pod Probes”, Speaker Deck, January 2023. Hockin is the designer of the probe API and a long-time maintainer of the Kubernetes node subsystem. The deck walks through the state machine of each probe type. ↩
Henning Jacobs (Zalando), “Kubernetes Liveness Probes Are Dangerous”, 2019. The widely-cited piece that articulated the cascade pattern. Also notes that Pod Disruption Budgets do not constrain liveness-probe-triggered restarts — an often-missed nuance. ↩
Kubernetes, “Liveness, Readiness, and Startup Probes”, official documentation. Explicit cascading-failure warning added to the canonical guidance. ↩
moby/moby issue #28400, “Restart container on unhealthy status”, opened November 2016. Still open as of 2026 — one of Docker’s longest-standing unimplemented feature requests. ↩
Will Farrell, willfarrell/autoheal, GitHub. The de-facto Docker Compose pattern for restart-on-unhealthy, with over 100M pulls on Docker Hub and active maintenance into 2026. ↩
AWS, “Choosing the right health check with Elastic Load Balancing and EC2 Auto Scaling”, April 2025. “Shallow health checks only make ‘on-box’ checks…” — current AWS guidance using the shallow/deep vocabulary. ↩
Spring, “Liveness and Readiness Probes with Spring Boot”, March 2020. Formalizes LivenessState and ReadinessState as distinct application concerns rather than a single “health” concept. ↩
Google SRE Book, “Addressing Cascading Failures”, Chapter 22. The general pattern of recovery mechanisms feeding the failure they were meant to recover from — the queue-saturation / restart-storm family of incidents this post is one instance of. ↩