Observability for Autonomous Systems Is a Presence Problem
Essay — — by Mahmoud Zalt
The trace looked fine. The agent was not there.
A few months in, I sat down to debug a single agent. The user said it had stopped helping. I pulled the trace. Every span was green. Tool calls returned. Token counts looked normal. The workflow completed. From every dashboard I owned, this agent was healthy.
Then I read the actual messages. The agent was answering, but no longer answering the user in front of it. It was producing a generic version of the request, the kind of reply you write when you have stopped paying attention but still want the meeting to end. It remembered the last topic. It had forgotten the person. None of that shows up in a trace. There is no span called context drift. No metric for emotional absence. That was the moment I stopped trusting the green check.
Stack traces show code. Agent traces have to show attention.
For normal software, a stack trace is a clean object. You can read which line ran, what came in, what came out. The system either did the thing or threw. Observability for a regular backend is the art of compressing that into counters and percentiles, then alerting when one of them moves. Numbers reduce to truth, and the truth is mostly about did the code execute correctly.
An agent is doing something different. It is not just executing. It is deciding. It is choosing which tool to call, which memory to lean on, which sub-agent to hand off to, whether to ask a question or just answer. Two agents can run the same code path and produce completely different outcomes, because the interesting work happened in the prompt, the context, and the silent ranking of options inside a model. The log shows the choice it made, never the choices it considered and rejected.
What I need to see is closer to a colleague's working notes. What did the agent think the goal was. What did it remember. Which tool did it reach for, and why. When it handed off, did it pass the right context, or a polite summary that lost the thing that mattered. That kind of log is not about errors. It is about whether the agent was actually present.
A null reading is not a healthy reading
The most expensive mistakes I have made running this workforce share a pattern. A signal was missing. The dashboard treated missing as fine. Days later, a user came back to find their agent had been quietly useless, or paused, or hallucinating against an empty memory, and every monitor said all systems green.
There is a deep difference between a healthy reading and a null reading, and most observability stacks blur the two. If a tool did not log a span, that does not mean it ran cleanly: it probably means it did not run at all, or the instrumentation broke, or the agent silently decided not to call it. If a memory layer returned no rows, the lookup may have failed and the agent improvised from nothing. The screen still shows zero errors.
I now treat the absence of a signal as its own signal. Every span the agent should have left behind, but did not, is a question the system has to ask out loud. These checks exist to catch attention going missing, before the user does.
The record I actually want is a record of attention
The traces I find genuinely useful do not look like spans. They look like a diary. Here is what the agent believed the user wanted. Here is the memory it pulled, and the memory it ignored. Here is the moment it decided to ask a clarifying question instead of guessing. Here is where it handed off, and the exact context it carried with it. The interesting events are not errors. They are decisions, and the small reasons behind them.
That changes how I instrument. I no longer try to wrap every function with a timer. I try to make sure every meaningful decision leaves a trace of why. Why this tool over that one. Why this skill loaded. Why this sub-agent got the task. Why the model heard question A instead of B. It is the only material that lets me, weeks later, tell whether the agent was present or just running through motions.
A finished trace tells you the work happened. A record of attention tells you whether anyone was home while it did.
Hand-offs are where presence quietly dies
The hardest case is multi-agent work. Every hand-off is a moment where the system has to decide what to pass forward, and that decision is almost always lossy. Names get dropped. Specific wording the user cared about gets summarized away. A clarification the first agent earned with effort gets reduced to one sentence the next agent has to guess at.
From a metrics standpoint, every hand-off looks the same: a message went from one workflow to another. From a presence standpoint, hand-offs are where the agent system stops paying attention to the user and starts paying attention to itself. The receiving agent picks up the abstracted summary, treats it as the whole story, and moves forward in a direction the user never asked for. No error. No retry. Just a small loss of context the user feels as the agent not getting it. What I instrument now is the diff: what the originating agent knew that the receiving agent did not.
What I tell myself before reading a dashboard
Green is not evidence. Done is not evidence. Counts are not evidence. The only thing that counts as evidence is a trace of attention: the decisions, and the context the agent had when it made them. For an autonomous system, observability is part of the product surface. The user can feel the difference between an agent paying attention and one going through motions, even when every internal monitor says fine. The job of the trace is to let me feel that same difference, before they do.
Running this workforce has slowly turned me into a worse engineer and a slightly better manager. The instinct to add a counter and call it done does not survive contact with a system that can be technically perfect and emotionally absent at the same time. The work that matters now is quieter: traces that record attention, silence treated as a signal, hand-offs watched the way you would watch a junior teammate cover for someone on holiday. None of it fits cleanly in a dashboard tile, which is probably why it matters.