The Error Lives One Layer Up
Your monitoring dashboard is showing 245 errors in the last 24 hours. The errors come from the integration layer that talks to your backend services. The natural response is to investigate the integration layer: maybe it’s making too many requests, maybe it needs retry tuning, maybe there’s a rate limit somewhere that’s being exceeded.
That response is wrong.
Not because retry tuning never helps — it does — but because in this particular case, two backend components are completely dead. The integration layer isn’t misbehaving. It’s faithfully reporting that the things it depends on have stopped responding. Every “error” in the log is a correct report of a correct failure. The integration layer is doing exactly what it should do when a dependency dies.
Fixing the retry policy would do nothing. The errors would continue because the backends are still dead.
Where Errors Live vs. Where They Originate¶
In a multi-component system — any system where component A calls component B calls component C — errors tend to surface at the layer above the failure point.
When component C stops responding, component B logs an error on the call that failed. Component B then returns an error to component A. Component A logs an error on the call that failed. Both errors end up in your monitoring, but neither error is in component C’s logs — because component C has stopped logging entirely.
The operator who sees the most errors is the one farthest from the actual failure. The operator watching component C’s metrics would immediately see that it’s dead — but they don’t know to look, because the alert fired in component A.
This is the fundamental problem with alert-first debugging in layered systems: the metric that fires is where the impact surfaced, not where the cause lives. The alert tells you which component noticed the failure. It doesn’t tell you which component caused it.
The Two-Phase Reveal¶
What makes this pattern particularly tricky is that fixing the visible problem doesn’t fix the actual problem — it just peels back a layer.
In the 245-errors-per-day case: the two dead backends were responsible for about 130 of those errors, primarily through retry amplification. When a backend is dead, every request gets retried some number of times before giving up. Five retries per failure turns 26 underlying failures into 130 logged errors. Removing the dead backends drops the error count to roughly 115 — but that’s still high.
The remaining 115 errors reveal something new: a routing misconfiguration that was always there but hidden by the noise from the dead backends. Requests that should route to working backends are hitting a misconfigured path and failing. Fixing the routing drops the count further.
You couldn’t see the routing problem clearly until the dead-backend noise was gone. The loud failure was masking the quieter structural one.
This is the two-phase reveal: fix the most obvious upstream cause, and you uncover the next cause that was previously hidden by it. Systems rarely have a single root cause; they have a hierarchy of causes that reveal themselves as you work upstream.
Where This Pattern Shows Up¶
Web tier and database: Your API endpoint is logging high latency. The obvious hypothesis is a slow query. The actual cause is connection pool exhaustion — the database is fine, but every new connection attempt is queuing behind hundreds of others that are waiting for a transaction lock to clear. The query isn’t slow; the queue is deep.
Container orchestration: A Kubernetes pod is restarting in a loop. The pod logs show it’s crashing on startup. The actual cause is the OOMKiller terminating it before it fully starts — the restart loop is correct behavior in response to the memory constraint, not the root problem.
Distributed service mesh: Service A is returning errors to its clients. Service A logs show upstream timeouts from service B. Service B is healthy; it’s timing out because service C — which service B calls — has a network partition from a recent firewall rule change. The timeout propagated two hops before it became visible.
In each of these, the operator sees the error at the visible surface. The cause is somewhere else in the chain.
The Diagnostic Heuristic¶
Before optimizing the component that’s logging errors, ask: What was this component trying to do when it failed?
The answer to that question almost always points upstream. “The integration layer was trying to query backend X when it logged this error” → go check backend X. “The web server was trying to open a database connection when it returned this 500” → go check the connection pool, not the web server code.
This sounds obvious stated plainly. In practice, it’s easy to skip — especially when the failing component is owned by your team and the upstream component is someone else’s. The error is in your code; the ownership boundary creates pressure to investigate your code first.
Distributed tracing tools exist partly to make this easier. OpenTelemetry traces correlate spans across service boundaries, so you can follow a failed request from the component that logged the error back through every upstream call that contributed to it.1 The trace shows you the full causal chain, not just where the chain terminated with an error. Without distributed tracing, you have to correlate log timestamps and request identifiers manually — which is possible but slow.
The Metric Is a Direction, Not a Destination¶
The error count in your monitoring is telling you where to start looking, not what to fix.
When the count is high, the immediate question isn’t “how do I reduce this number” — it’s “what is producing this number, and why?” Reducing the number by adding retry suppression or error filtering is almost always treating the symptom. The underlying failure continues; you’ve just made it less visible.
The right optimization target is the component that’s actually broken, not the component that noticed it was broken. Finding that component requires following the causal chain upstream, through however many layers separate the visible error from its origin.
Two dead backends and a routing misconfiguration look exactly like a rate limit problem from the dashboard. They look like completely different problems from the log files of the components that stopped responding. The insight is that both perspectives are describing the same reality; one of them is just much more useful for diagnosis.
Start at the error. But follow it back.
-
OpenTelemetry, “What is Distributed Tracing?”, OpenTelemetry Documentation. Distributed tracing enables visualization of request flows across service boundaries, correlating spans from multiple components into a single trace. This makes upstream failures visible even when only downstream components generate alerts. ↩