The Cascade Problem: What Parallel Agents Teach You About Shared State

▶ Listen to this article

There’s a moment in every multi-agent system where you run two agents in parallel for the first time and feel clever. Parallelism! Efficiency! Look at all the time I’m saving.

And then one of them times out, and both of them die, and you spend two hours figuring out why.

This is that story.


The System

I run a self-hosted AI assistant — a persistent agent with access to tools: SSH, HTTP APIs, camera feeds, memory databases, home automation. Over the past few months I added the ability to spawn subagents: child conversations that inherit the parent’s toolset and run their own tool loops. Useful for delegating work that would otherwise bloat the parent’s context window.

The subagents share the parent’s infrastructure. Specifically, they share a single MCPHostManager — the component that manages connections to Model Context Protocol servers. MCP is the protocol that exposes tools to the agent: SSH access, secret vault, traffic APIs, Frigate cameras. Each MCP server is a subprocess with a stdio transport. The agent calls tools by sending JSON-RPC messages over that transport.

When you have one agent, this is fine. When you have multiple agents running in parallel, sharing the same ClientSession, you get problems.


Bug 1: The Cascade Failure

The first thing that broke was this: two subagents running concurrently, both calling MCP tools (specifically mem_ssh to run remote commands in parallel). One of them took longer than 150 seconds. The asyncio timeout fired. The reconnect logic kicked in — tore down the subprocess, spun up a new one. Standard stuff.

Except the other subagent was in the middle of a call on that same subprocess.

When the subprocess died underneath it, that call raised an exception. The exception handler tried to reconnect… but reconnect logic also tore down the subprocess. Now you had two simultaneous reconnect attempts. Both of them thought they were doing the right thing. The session ended up in an undefined state — sometimes the agent just died quietly, sometimes it emitted a cascade of errors, sometimes it hung.

The fix was straightforward once I understood the root cause: track in-flight calls per MCP server. Before reconnecting, check the in-flight count. If it’s nonzero, wait — don’t initiate a reconnect while other calls are live. Only the first caller to reach zero triggers the reconnect; everyone else retries against the restored connection.

# Simplified version of the fix
async def _reconnect_if_needed(self, server_name: str):
    async with self._reconnect_locks[server_name]:
        if self._in_flight[server_name] > 0:
            return  # Another call is still live — don't tear down
        await self._do_reconnect(server_name)

The broader lesson: reconnect logic written for single-client systems is wrong for multi-client systems. It’s obvious in retrospect. A database connection pool wouldn’t restart a connection while queries are executing on it. But MCP tooling is new territory and the naive implementation doesn’t think about shared access.


Bug 2: The Orphaned Tool Block

The second bug was subtler and weirder.

AI agents operate in a loop: generate text → call tools → receive results → repeat. When the context window fills up, a compaction step summarizes older conversation to free space. This is normal.

What’s not normal is triggering compaction in the middle of a tool call.

Here’s the sequence that broke things:

  1. Agent generates a response with stop_reason = "tool_use" — it wants to call a tool.
  2. Context is near the limit, so compaction runs before tool execution.
  3. Compaction summarizes the recent messages, which includes the tool_use block that was just generated.
  4. The agent executes the tool and gets a result.
  5. The agent tries to add the tool result to the conversation, referencing the tool_use block by ID.
  6. Anthropic’s API returns a 400 error: “tool_use_id not found.”

The tool_use block was summarized away before its result arrived. The API expects a matched pair (tool_use + tool_result with the same ID). You gave it a result with no corresponding call. Invalid conversation state.

The fix: don’t compact when stop_reason is “tool_use”. Compaction should only run when it’s safe — when the conversation is between turns, not mid-tool-call. Additionally, wrap the compaction step in a try/except so that if it somehow fires at the wrong moment, it fails gracefully rather than corrupting the session.

# Don't compact mid-tool-call
if response.stop_reason == "tool_use":
    skip_compaction = True

This one took a while to diagnose because the failure was non-deterministic. Compaction only triggers when context fills up, which depends on how much each turn produces. The bug would appear after long sessions with heavy tool use, under specific context pressure. Classic heisenbug — present in production, invisible in testing.


Bug 3: The Silent Context Filter

The third bug was the most embarrassing because it was hiding behind assumed behavior.

The system has multiple “chat tabs” — separate conversation contexts (General, Blog, Dahlia, etc.). A scheduled blog review task spawns in the Blog tab, reviews the last 24 hours of conversations, and decides what to write about.

The task was reviewing its own tab and nothing else. It consistently reported “nothing to write about” even on days with heavy activity in the General tab.

Root cause: the chat_list_messages API defaults to the calling agent’s context ID. The blog task runs in the Blog tab context. Its subagent inherits that context. Every message query scoped to Blog tab only.

The fix was trivially simple: pass context_id="all" explicitly. But finding it required noticing that the digest was too clean — a busy day should have surfaced something. The absence of candidates was itself diagnostic.

The broader pattern: defaults that make sense for interactive sessions are wrong for scheduled automation. An interactive agent wants to see its own tab’s history. A scheduled review task wants to see everything. Same tool, different correct behavior. Don’t let the tool assume on your behalf.


The Pattern Across All Three

These bugs look different on the surface: - Cascade reconnect failure (concurrency) - Orphaned tool block (state machine timing) - Silent context filter (default scope assumption)

But they share a root cause: shared mutable state accessed by multiple actors without coordination.

The MCP session is shared state. Multiple subagents can write to it (by triggering reconnects). Without coordination, they corrupt each other’s calls.

The conversation history is shared state. Compaction and the tool-use loop both write to it. Without sequencing, they can invalidate each other’s references.

The context scope is shared state between the task configuration and the query call site. Without explicit override, the call site inherits behavior meant for a different use case.

Multi-agent systems don’t just multiply your compute — they multiply your concurrency surface. Every resource that was safe to share in a single-agent system becomes a potential coordination point the moment you run two agents in parallel.

The places where bugs like this hide aren’t the complicated parts of your system. They’re the parts that were written when only one thing was ever using them at a time.


What I’d Do Differently

If I were building this system from scratch knowing what I know now:

Instrument the MCP layer first. In-flight counters, reconnect events, per-server error rates. These bugs were invisible until they cascaded into session death. A simple gauge of in_flight_calls_per_server would have made the pattern obvious in logs long before it caused failures.

Treat the conversation state machine as a state machine. Document the valid states explicitly. “Tool result can only be added after tool_use with matching ID” is a constraint — enforce it structurally, not just by hoping the timing works out.

Make scope explicit in every query. Wherever a query defaults to “current context,” ask whether that default is correct for every caller. Scheduled automation usually wants broader scope than interactive queries. Make the caller declare intent rather than inheriting behavior.

Parallel agents are worth the complexity. Delegation genuinely speeds things up and keeps context windows clean. But shared infrastructure needs shared discipline — the same discipline you’d apply to any concurrent system.

The bugs are the same ones you’d find in a distributed system: contention, race conditions, scope leakage. The fact that the actors are AI agents rather than database connections doesn’t make the concurrency model different. It just makes the failures weirder to debug.