← Back to writing

2026-03-16 · 3 min read

Finding Bugs That Pass Every Test

A .find() with a block body and no return. A retry path that's been dead code since the day it shipped. An await that was never added. Linters, type checkers, and test suites all miss these because you didn't know to test for them.

Correctness bugs. The code runs, no errors thrown, but it silently does the wrong thing.

I built ThreadTrace to find them.


Why everything else misses these

Linters check syntax patterns, type checkers verify type contracts, and tests verify the behavior you thought to test. None of them trace what actually happens when something fails, across file boundaries, into node_modules/ source, all the way through to where the error finally lands.

Most error handling bugs aren't at the throw site. They're three handlers deep, where someone assumed the caller would deal with it and the caller assumed the callee already did. You'd need to read five files to spot the gap. Nobody does that in code review.


Four agents. No shared memory.

ThreadTrace uses four agents that never share context. It's a structural constraint, not a prompt engineering trick.

Icarus maps the codebase by damage potential. Payment path > logging utility. It ranks, it never investigates.

Theseus takes a seed ("what happens when a batch retry fails?") and traces it all the way through. Every branch, catch, callback, and return, across files, into library source rather than documentation. It documents behavior neutrally: "this function catches the error and returns an empty array." Not "this function incorrectly swallows the error." Premature classification corrupts everything downstream.

Daedalus asks: how do the siblings handle this? It finds 3–5 functions doing the same operation and compares error handling across all of them. Output is a comparison table, not a verdict. "4 out of 5 re-throw. This one swallows."

Minos is the independent verifier. It receives the hypothesis and file paths, nothing else. It has no idea how the hypothesis was formed or what evidence led to it. Default stance: the code is correct. It actively tries to disprove the bug.

If Minos can't disprove it, it writes a concrete test case.


You can't prompt away confirmation bias

The obvious approach: one agent finds something suspicious, reasons about it, decides if it's real. The flaw is structural. The agent that formed the hypothesis is already primed with the evidence. It will confirm.

You can add "be critical" to the system prompt. Doesn't help. The bias comes from what's in the context window, not from the instructions.

Minos can't access the discovery context because it was never given it. Zero false positives across production runs. When Minos confirms a bug, the code is actually broken.


Bug chains

The most useful findings aren't individual bugs. They're chains where bugs interact.

First production run: ThreadTrace found a .find() with a missing return that made an entire retry path dead code. While tracing that path, it discovered three unawaited async calls creating a race condition, invisible because the dead retry prevents them from ever executing.

Fix the .find() without fixing the awaits and you activate the race condition. The bugs are linked, and neither is visible in isolation. ThreadTrace keeps investigating after the first finding because the interesting failures are usually downstream.


Limitations

TypeScript/Node.js only. Requires Claude Code for multi-agent spawning. 50–100K tokens per full investigation. Static analysis, no runtime reproduction. Not a replacement for tests. It finds the bugs you didn't know to test for.

ThreadTrace on GitHub