# The Handoff Is Where Multi-Agent Systems Break

*Essay — 2026-01-30 — by Mahmoud Zalt*


**The thing I keep noticing.** When a multi-agent system goes wrong, the first instinct is to look at the agents. Read the prompt, check the tools, tweak the model. After watching hundreds of these chains play out, I have come to a different conclusion. The agents are almost never the problem. The problem is the space between them: the moment one finishes and the next begins. That tiny gap is where the work falls on the floor.

## A small story that keeps repeating

Here is the scene I have watched dozens of times. A user asks the team lead for a post about onboarding. The team lead routes it to the marketer. The marketer decides she needs a sharper hook and pings the writer. The writer comes back with something fluent, well-formed, and completely off key. Wrong tone, wrong audience, wrong half of the brief.

The first time it happened I assumed the writer was weak, so I swapped the model. Same result. Then I assumed the marketer was a bad router, so I rewrote her prompt. Same result. Eventually I went back and read the actual message the writer received. The writer had been handed three sentences. The team lead had been working with three paragraphs. Somewhere between them, two thirds of the context had quietly evaporated. The agents were not failing. The handshake was.

## Why context erodes at the boundary

The reason this is so hard to see is that handoffs feel free. The team lead has already done the work of understanding the request, so passing it on looks like a one-line action. But the team lead is carrying a small cloud of things that never appear in the message. The user is a solo founder, not a marketing team. The previous attempts flopped. The tone has to be quieter than usual. All of that lives in working memory and none of it survives the handoff unless it is written down.

Models are not telepaths. When an agent receives a short brief, it fills the gap with priors. Generic priors. The kind that produce confident, fluent, generic work. The output looks fine in isolation. It only feels wrong once the team lead reads it and thinks, that is not what I meant. With five agents in a chain, the drift is small at each step and enormous by the end.

## Briefs, not messages

The first change that actually helped was the most boring one. Stop letting agents talk to each other in free text. Make every handoff a structured brief with named fields. Goal. Audience. Tone. Inputs the receiver should treat as truth. What good looks like. What the sender already tried and ruled out. What matters is that the receiver opens the message and finds a brief instead of a sentence.

Briefs take more tokens and feel like ceremony. But the alternative is a writer who keeps producing fluent work nobody asked for, which is far more expensive in the end. Once briefs went in, the per-handoff failure rate dropped sharply, and the drop was largest where I expected it to be smallest: between agents that already knew each other well. Familiarity had been hiding the loss, not preventing it.

> Multi-agent reliability is not a property of the agents. It is a property of the protocol between them. If the protocol is sloppy, no model upgrade will save you.
> — From the team's internal notes

## Receipt of work, not just delivery

The second change was an obvious idea I had been resisting because it felt like bureaucracy. Add a receipt step. When the writer returns work to the marketer, the marketer is not allowed to immediately pass it up. She has to first acknowledge what she received, in her own words, against the original brief. If the brief asked for a quiet hook and the writer returned a punchy listicle, that mismatch surfaces at the receipt step, not three layers later in front of the user.

Receipts also force the receiver to actually read the work, which sounds laughable until you realize how often an intermediate agent simply forwards a result without checking it. The agents in the middle hallucinate competence on each other's behalf, assuming the previous step did its job because the output looks polished. A receipt step interrupts that assumption.

## Shared journals beat private memories

The third change is still evolving. Multi-agent systems work much better when the agents share a journal any of them can read, instead of relying on private memory plus messages. When the marketer can look back at what the team lead wrote a week ago, the message between them gets to be short again. The brief is no longer carrying the entire context, because the context already lives somewhere the receiver can fetch it from.

There is a tradeoff I keep wrestling with. A shared journal is also a place where wrong information spreads quickly. One agent writes a sloppy summary, and every other agent now reasons from it as if it were truth. So we have rules around what counts as journal-worthy and how to flag uncertainty. Even an imperfect shared journal beats each agent guessing at the others' state from the message it just received.

## The team lead as a router, not a doer

The last shift goes against the instinct of building smarter agents. The team lead should not be the smartest worker. She should be the strictest router. Her job is to write briefs, interrupt handoffs that look thin, demand receipts, keep the journal honest. The moment she starts doing the work herself, the team is effectively one agent again, just slower.

What I look for now is not how clever each individual agent is. I look at how thick the connective tissue is. How structured the briefs are. How honest the receipts are. How fresh the shared journal stays. If those are healthy, even average models behave like a competent team. If they are sloppy, even the best models will produce a beautiful, fluent, completely wrong output once a day.

## What this changed in how I build

I used to think of an AI workforce as a collection of capable individuals. I now think of it as a small organization, where the org chart matters more than any single hire. The interesting question is how they talk to each other when nobody is watching, and whether the next one in the chain can pick up the work cold. Most of the gains I have seen in the last year of running Sistava came from improving that connective tissue, not from upgrading the agents themselves.

If you are building a multi-agent system right now and something feels off, my honest advice is to stop tuning prompts for a week and instead read every message that crosses a boundary. Ask whether the receiver could possibly do good work with what they were just handed. Most of the time, the answer is no, and you can see it the moment you look. The agents are usually fine. It is the handshakes that need the work.