Sistava

A Null Reading Is Not a Bad Reading

Essay — by Mahmoud Zalt

Why missing signals in monitoring should never default to the worst case. A small story about a heartbeat job that paused active workers overnight, and the rule we now live by: null means we do not have data, not that we have bad data.

The night I learned what null actually meant

We run a background job that pauses idle AI employees to save compute. The signal is simple: a last-seen timestamp on each employee. If the gap between now and last_seen exceeds a threshold, pause it. The code reads like every other reaper job written in every other system on earth.

Then someone came back from lunch, opened the app, and their employee was gone. Not deleted. Paused. The conversation it had been holding open was suddenly cold. They had been using it twenty minutes earlier. The employee had a null last_seen because the field had been added in a recent migration and that particular row had not been backfilled yet.

Somewhere upstream, when the reaper computed the gap, the missing timestamp resolved to negative infinity. Now minus negative infinity is, of course, infinity, which is larger than any threshold you can write. So the reaper decided the employee had been idle since before the heat death of the universe, and paused it. The code was technically correct. It was also, in every way that mattered, wrong.

Null is not bad. Null is unknown.

The mistake was older than the bug. Somewhere in my head, missing data lived in the same drawer as suspicious data: if a field is empty, treat it as the worst case so nothing slips through. That instinct comes from form validation, where a missing required field really is a failure. It does not survive contact with monitoring. The absence of a signal is not a signal of badness. It is a signal that you do not yet know what is happening.

Once you frame it that way, the right default becomes obvious. A null reading is not zero. It is not infinity. It is a question mark. The only safe operation on a question mark is to fall back to a known lower bound you can actually defend. In our case, the defendable lower bound was the employee's hire date. If we have no last-seen, the most we can say is the employee has existed at most since it was hired. That is a real, finite number, and it almost never crosses the idle threshold.

A missing signal is not the worst version of a signal. It is the absence of one. The first job of a destructive routine is to know the difference.

Mahmoud Zalt

Reads can be optimistic. Writes must be pessimistic.

There is a separation I keep coming back to in every reliability conversation I have with engineers building agent systems. Reads can be optimistic. When you are showing a dashboard or answering a query, it is fine to take the missing value and present the best plausible interpretation. The user understands a dashboard can be slightly stale, and the cost of guessing wrong on a read is a refresh.

Writes are different. A write that pauses, deletes, disables, or terminates something has to clear a much higher bar. It cannot proceed on inference, and it cannot proceed on a default. It needs positive evidence that the destructive action is justified. If the evidence is missing, the write must abstain. Every time we have caused damage at scale, the root cause was a write that proceeded on the strength of a null.

Every destructive sweep needs a paired recovery sweep

The deeper fix, the one that took the most time to internalize, is that careful code is not enough. Every destructive background job will eventually do the wrong thing to some row. The schema will change underneath it. A new feature will introduce a field it does not know about. The honest design choice is to assume the destructive sweep will sometimes be wrong, and pair it with a recovery sweep that can undo the damage.

In our case, the reaper that pauses idle employees now has a sibling that revives wrongly paused ones. The revive job runs on the same cadence, reads the same signals, and looks specifically for employees that were paused but show evidence of recent activity (a message from the user, a recent task update, anything that suggests the pause was a mistake). It is the symmetric inverse of the destructive job. The destructive job no longer needs to be perfect. It only needs to be roughly right, because the recovery job catches what it gets wrong.

Property-test the null axis, not just the happy path

The last thing that changed was how I write tests for any policy with a destructive branch. The old habit was to test the happy path (signal present, threshold crossed, action taken) and a handful of error cases. That is not enough. The interesting axis is not whether the threshold is crossed. The interesting axis is whether the signal is even there to begin with.

We now write property tests that enumerate the combinations: signal present and recent, signal present and stale, signal missing entirely, signal present but newly added, signal present but malformed. For each combination, the test asserts that no null path can reach the destructive branch unless the lower-bound fallback also justifies it. The test is unglamorous. It catches the kind of bug that took down our agents on a Wednesday night.

What I keep coming back to

The longer I run this system, the more convinced I become that almost none of the outages I cause are logic errors. The logic is usually fine. The outages are caused by a missing value being treated as a strong negative somewhere in a chain of inferences. A heartbeat that did not arrive is read as a heartbeat that arrived saying no. A tool that did not return a timestamp is treated as a tool that did not run. Each of these is the same shape of bug, dressed up in different clothes.

The rule I now hold across every background job we ship is short enough to fit on one line. Null is not bad, null is unknown. Reads can be optimistic, writes must be pessimistic. Every destructive sweep needs a paired recovery sweep. Property-test the null axis. None of those clauses are clever, and none will impress a reviewer. They are the unfashionable layer of reliability work, and the only part that has held up across two years of running an AI workforce at scale.

I do not think we are done finding this pattern. Every new field we add, every new background job we ship, opens another surface where missing can be confused with bad. The work is not to eliminate that confusion permanently. The work is to notice it earlier each time, default to the most cautious reading when in doubt, and keep the recovery sweep ready for the day the destructive sweep gets it wrong. Quiet, slow, and unglamorous. That is what the work actually looks like.