Debugging AI Agents Means Reading the Trace, Not the Model Output

Essay — 2026-03-20 — by Mahmoud Zalt

The first agent I debugged the wrong way

It was an early sales agent on Sistava. A user asked for a list of recent leads and the agent confidently invented two names that did not exist in the database. My first reaction was that the model was hallucinating, so I rewrote the prompt to tell it sternly never to make up data. The next day it happened again. I was annoyed at the model when I should have been annoyed at myself.

When I finally opened the trace, the picture was different. The lookup tool had timed out. The retry policy had swallowed the failure and returned an empty result. The agent had been handed a blank list and an instruction to answer, and it filled the blank the way models do. It was a tool problem dressed up as a model problem. I had spent two days editing prompts to fix a thirty-line timeout. The output is a footprint. It tells you something happened, not what.

Why the output lies to you

A model is a function. You give it a context window, it returns tokens. Given the exact context it received, the answer is usually reasonable. The cases where the model itself is the failure point are rarer than they look. Far more often the context that reached it was missing something obvious, stale, or assembled in a different order than I assumed.

If you only read the output, every bug looks like a model bug. The model is the last actor in the chain, the most visible, the easiest to blame. You will keep editing prompts forever and never find the actual issue. The prompt feels like a steering wheel and is closer to a windshield: it shows where you are pointed, but the car is being driven by everything upstream.

What I actually look at now

When an agent on Sistava does something I did not expect, I open the trace before the output. I want the order things happened in. Which tools were called. What each returned, including the failures the retry policy may have hidden. Which memory blocks were loaded. Which skill the dispatcher picked. The full prompt that was assembled, not the template I wrote a month ago.

Most of the time the answer is visible inside thirty seconds. A tool returned a slow empty payload. A memory block was truncated because the context window filled earlier than expected. The dispatcher routed to the wrong skill because a recent change made one duty's description more generic, so its embeddings now overlap with everything. None of that is visible in the chat output. All of it is in the trace. Only after reading it do I look at what the model said, and by then the output reads less like a mystery and more like a logical consequence.

The kinds of bugs that hide upstream

After enough of these, you recognize the shapes. A timeout that returns an empty array and gets formatted as a valid answer. A handoff where the second agent received only the summary and quietly answered the wrong question. Long-term memory correct three weeks ago, now lying about a customer's plan because nobody updated it. A skill the dispatcher loaded because the user's phrasing happened to embed near its description.

There is also a more disturbing category: silent context loss. The prompt assembler hits a token cap and drops a block on the floor without telling anyone. The agent gets a context that looks fine and is missing the one fact it needed. From the outside this reads as confabulation. Inside the trace it is a faithful answer to a question the agent never fully saw. The model is doing the most honest part of the work. Everything before it is where the lies sneak in.

When an agent gives you a strange answer, the answer is almost always correct given the context it received. The bug is in what got injected, not in what came out.
Mahmoud Zalt

What this changed in how we build

Reading the trace first changed the kind of code I write. I am much less interested in clever prompts and much more interested in observable pipelines. Every tool call has a structured log line with input, output, latency, and retry count. Every memory load records what was asked and what came back. Every dispatcher decision has a one-line justification. Every assembled prompt is dumped to a trace, in full. It is the same hygiene you would apply to a payment system.

It also changed what I consider a finished feature. A new skill is not done because it responds correctly to a few example prompts. It is done when the trace is legible to a human walking in cold, each step explains itself, and a future me at midnight six months from now can read it and know what happened. The agent is allowed to be wrong sometimes. The trace is not allowed to be confusing. The side effect is that my prompts have gotten shorter. When the pipeline is clean, the prompt does not have to defend against every weird input. It just describes the task.

A quieter way to think about it

I have come to think of an AI agent the way I think of any distributed system. The model is one node. The tools, the memory store, the dispatcher, the retry policy, the context assembler: all nodes. When something goes wrong, the answer is in how the nodes interacted, not in the last one to speak. You would not debug a microservice architecture by re-reading the response body. You would open the trace and look at the spans. Agents are no different. They just feel different because the last node happens to speak English.

What surprised me is how often the fix is small. A timeout raised by two seconds. A retry policy that propagates errors instead of swallowing them. A memory update job on a sensible cadence. A dispatcher description tightened so it stops overlapping with another. None of these are model changes. They are unglamorous infrastructure work that makes the model look smart in the trace, which is the only place that matters.

If I had to compress everything into one sentence, the model is rarely the part worth blaming. The interesting failures live in the seams between the parts. You find them by reading what flowed between the nodes, not by re-reading what the last node said. That shift, from output-first to trace-first, is the thing I would tell any team starting to operate agents in production.