The Metric Dropped. The Cat Was Fine.
The water-intake metric crossed a threshold and started lighting up the briefings. The cat in question — name’s Turing, after the obvious — was reading near-zero on the fountain sensor for a second day. The morning briefing flagged it. The afternoon briefing escalated it. By the third briefing the system was effectively asking whether to wake the vet.
The cat was fine.
What happened in the gap between “the metric is screaming” and “the cat is fine” is one of the cleaner cascading-failure narratives I’ve watched play out in real time, and it generalizes well beyond pet infrastructure. The drop wasn’t one failure. It was a single root cause expressing itself through three different channels at once, superimposed in the data into one big cliff that looked like one big problem.
The signal¶
The sensor tracks fountain visits and dispensed volume. Nothing else. The number went from “normal” to “almost zero” over about 72 hours. That shape is bad. Cats that suddenly stop drinking are a real emergency — kidneys, blockage, fever, a dozen things that go from “we should watch this” to “we should be at the clinic” in not very long.
So the alert fires. The hypothesis is the obvious one. Reach for the carrier, prep the vet.
But somebody (in this case, the admin) decided to check the fountain first. The fountain was dead. The pump wasn’t moving water. Mineral scaling had built up in the impeller housing over months until the impeller physically couldn’t turn. There’s a YouTube video that teaches you to disassemble the housing and descale it with vinegar; nobody mentioned this at purchase time. After a soak, reassembly, and a top-off, the fountain was running fresh again.
That’s the satisfying ending: equipment failed, equipment fixed, false alarm. Except it isn’t actually the ending, because the metric still doesn’t fully add up.
The dual-source test¶
Here’s the test that revealed what was really going on: the morning after the fix, set out both the cleaned fountain and a separate bowl of water. Watch what the cat does.
What the cat did: drank from the fountain at a normal volume (about a hundred milliliters by mid-morning), drank a smaller amount from the bowl, ran around being a cat, ate normally, vocalized normally. Fully fine.
That dual-source test disambiguates several hypotheses at once — though not all of them, and being honest about which ones is part of the lesson.
The first hypothesis it kills cleanly: that the cat was sick. If the original problem had been purely behavioral and medical (cat off water, dehydration in progress), Turing would still be drinking little from either source post-repair. He wasn’t. He was drinking normal amounts from a clean fountain and topping up from a bowl. Medical hypothesis: dead.
The second hypothesis it kills cleanly: that the cat was simply drinking elsewhere all along (a sensor blind spot with no other problem). If that had been the whole story, restoring the fountain shouldn’t have changed the cat’s preferences. He’d have stayed on the alternate source. He didn’t — he went back to the fountain at normal volumes, which means the fountain was his preferred source when it worked.
The third hypothesis is the interesting one, because the data doesn’t fully resolve it. The pre-cliff decline phase — the gradual drop in dispensed volume in the days before the pump fully seized — could be either gradual equipment degradation (the impeller producing less flow as scaling progressed) or gradual behavioral change (the cat drinking less as water flavor degraded), or some mix. The dispensed-volume sensor can’t tell those apart. Both are consistent with the trend, and the dual-source test was run post-repair, so it can only tell us how the cat behaves now, not how he was behaving before. What we know is that cats are documented to be sensitive to water palatability and that the same root cause (mineral buildup) was plausibly affecting both flow and flavor simultaneously. The most defensible read is: probably both, in some split we can’t extract from the data we have.
That’s its own kind of finding — “compatible with multiple causes, can’t disambiguate from available data” is a legitimate diagnostic verdict, and treating it as one is healthier than picking the more dramatic explanation just because the data tolerates it.
Three channels, one root cause¶
The story that best fits the data:
Channel one: gradual degradation in the dispenser. Mineral scaling progressed in the impeller housing over months. As it did, two things happened in parallel through the same physical mechanism. The pump’s effective flow rate dropped — less water moved per session — and the water’s flavor degraded because the cat was tasting whatever was leaching from the scaled surfaces. Both effects would push the dispensed-volume metric in the same direction (downward), and the metric can’t tell you which one is dominant. This is the slow-decline phase of the curve.
Channel two: equipment failure. The same scaling that produced the gradual decline eventually jammed the impeller entirely. The fountain stopped dispensing water at all. The metric correctly went to zero, but now for a discontinuous reason on top of the continuous one. This is the cliff.
Channel three: sensor scope. When the fountain stopped, supplemental bowl water got set out. The sensor doesn’t track bowls. So even though hydration continued, the sensor saw zero. The cat was hydrating fine; the instrument couldn’t see it. This is why the metric stayed at zero after the cliff instead of recovering as the cat adapted.
All three share a common root cause: mineral buildup in the dispenser hardware. The same physical phenomenon produced a slow decline signal, a sudden equipment failure, and an instrument blind spot — all of which superimposed into “the metric is dropping to zero and won’t come back.”
If you tried to model this as a single failure mode, none of them fit cleanly. The decline-then-cliff shape is hard to explain as one continuous process. The continued zero after bowl introduction can’t be dehydration (the cat is visibly fine post-repair). Each individual hypothesis explains part of the data and gets the rest wrong.
Three channels, one root cause, three different shapes in the data, all rendered as one ugly downward curve.
The pattern, generalized¶
This shape shows up in production systems constantly, and it’s one of the harder things to diagnose under pressure.
You see a metric crash. The instinct is to find the explanation. But “the metric crashed” can be a superposition of independent failures that happen to share a root cause and present through different channels. Each channel has its own latency, its own shape, its own correlation with the others. When you stack them, the result looks like one big problem with one big explanation.
A few examples from systems I’ve actually watched fail this way.
A SaaS platform’s error rate climbs over a week, then spikes overnight, then plateaus. Root cause: a database connection pool sized for normal load. The week of climb was real — slow queries from a new feature consuming connections, causing intermittent failures that retries masked. The overnight spike was the connection pool fully exhausting under cron-job load. The plateau was the application’s circuit breaker kicking in and rejecting traffic at the edge so the alerting metric stopped getting fed bad data. Three different failure presentations, one resource exhaustion problem, all blending in the dashboard.
A storage system reports increasing read latency, then sudden write failures, then “everything is fine” after a restart that shouldn’t have helped. Root cause: a failing disk in a RAID array. Latency climbed as the controller worked harder to read past bad sectors. Writes started failing when the array degraded enough to drop to degraded-mode write policy. The restart “fixed it” because the controller marked the disk failed during boot and the array switched to operating without it — same hardware, different mode, real underlying problem masked by the new equilibrium.
A queue’s depth grows, consumers slow down, throughput collapses. Root cause: noisy-neighbor CPU steal on the consumer hosts from an unrelated workload. The queue depth was a symptom of slow consumption. The consumer slowdown was the noisy neighbor. The throughput collapse was downstream backpressure from the slowdown. None of them is “the bug” — they’re three projections of the same underlying contention.
In every case, asking “what’s wrong with the queue?” or “what’s wrong with the database?” or “what’s wrong with the cat?” gets you a wrong answer that explains part of the data and ignores the rest.
The diagnostic discipline¶
The discipline that actually works is boring and the same every time:
Suspect the measurement before you suspect the world. Sensors fail. Dashboards lie by omission. When a metric does something dramatic, the first question isn’t “what changed in the system?” — it’s “do I trust the measurement of what changed?” Check the sensor. Check the wire. Check whether the thing you think you’re measuring is actually what the instrument captures.
Suspect common causes before you suspect coincidences. When two unrelated metrics move at the same time, you almost always have one cause with two presentations, not two simultaneous failures. The shared root is usually upstream of both metrics in the dependency graph. Find that point. Look for things that touch both.
Run a dual-source test when you can. The single most useful diagnostic move in the fountain story was setting out a second water source. With two sources, the cat’s behavior could disambiguate hypotheses that a single source couldn’t. In production: dual-path traffic, blue/green deployments with both active, mirrored reads against two backends. Anything that lets you compare a known-good path to the suspect one without committing to a fix is gold.
Don’t stop at “fixed.” The fountain came back online and the metric went back to normal, but the data still had a story that wasn’t fully explained — specifically, the gradual pre-cliff decline. Following that residual confusion is what surfaced the question of whether the decline was equipment-side, behavior-side, or both, and forced an honest answer (“probably both, can’t fully separate them from this data”) rather than a tidy false certainty. The temptation after a restoration is to mark the incident closed. The lesson is in the part you don’t fully understand yet, including the parts where the lesson is “you can’t fully know.”
Find the common cause, not just the proximate one. “Pump seized” is the proximate cause of the cliff. “Mineral buildup” is the root cause. Restoring the pump fixes the cliff. Descaling the housing on a schedule fixes the next cliff before it happens. Most postmortems stop at the proximate cause because it’s where the visible damage was; the better ones keep digging until the explanation accounts for everything in the data, including the parts that don’t look like the main event.
What the metric was actually telling me¶
The metric wasn’t lying. The metric was screaming “the dispenser system is failing” — and it was right, in three different ways at once.
What was wrong wasn’t the cat. What was wrong was the infrastructure between the cat and the cat’s hydration: equipment degrading in a way that affected taste, then degrading in a way that affected delivery, observed through an instrument that couldn’t see around the equipment. Once you fix the dispenser, the metric goes back to normal because the underlying truth (the cat is healthy and drinks normally when the water is good) was never the problem.
This is the part that translates straight into production systems and is easy to forget under alert pressure: when a metric crashes, the metric is almost always honest about something. The skill is figuring out what it’s honest about. The failure mode is jumping to the most alarming interpretation — the cat is sick, the database is corrupted, the customers are leaving — when the data is actually telling you something quieter and more upstream.
The cat was fine. The infrastructure wasn’t. Both statements are true. The metric saw the second one and we tried to read it as the first.
Worth descaling the fountain monthly, it turns out. Worth descaling your incident-response intuitions about as often.