← Back to writing

2026-03-16 · 4 min read

Finding Bugs That Pass Every Test

There's a class of bugs that your linter won't catch, your type checker won't flag, and your tests won't fail on — because you didn't know to test for them. A .find() with a block body and no return. A retry path that's been dead code since the day it was written. An await that was never added.

These are correctness bugs. The code runs fine. No errors. It just silently does the wrong thing.

I built ThreadTrace to find them.


Why existing tools miss these

Linters check syntax patterns. Type checkers verify type contracts. Test suites verify the behavior you thought to test. None of them trace what actually happens when something fails — across file boundaries, into library source code, through the full error path until it reaches a terminal state.

Most error handling bugs aren't at the throw site. They're three handlers deep, where someone assumed the caller would deal with it and the caller assumed the callee already did.


The architecture: four agents, independent context

ThreadTrace uses four agents that never share memory. This isn't a prompt trick — it's structural.

Icarus scouts the codebase and ranks everything by damage potential. If this fails silently, how bad is it? A bug in the payment path matters more than a bug in a logging utility. Icarus maps. It never investigates.

Theseus receives a seed question — something like "what happens when a batch retry fails?" — and traces the error path end-to-end. It follows every branch, catch, callback, and return. It crosses file boundaries. It reads library source in node_modules/, not documentation. Theseus documents behavior neutrally: "this function catches the error and returns an empty array." Not "this function incorrectly swallows the error." Premature classification corrupts everything downstream.

Daedalus takes the trace and asks: how do the siblings handle this? It finds 3-5 functions that do the same kind of operation and compares error handling across all of them. The output is a comparison table, not a verdict. "4 out of 5 re-throw. This one swallows."

Minos is the key. Minos is the independent verifier. It receives the hypothesis and file paths — nothing else. It never sees how the hypothesis was formed, the confidence score, or the evidence from Theseus or Daedalus. Its default stance is that the code is correct. It actively searches for reasons the hypothesis is wrong.

If Minos can't disprove it, it writes a concrete test case demonstrating the bug.


Why context isolation matters

The standard approach: one agent finds something suspicious, reasons about it, and decides if it's a real bug. This has a structural flaw. The agent that formed the hypothesis is motivated to confirm it. It's already primed with the evidence, the narrative.

You can't prompt away confirmation bias. You can only architect around it. Minos literally cannot access the discovery context because it was never provided.

The result across production runs: zero false positives. When Minos confirms a bug, the code is actually broken.


Bug chains

The most valuable findings aren't individual bugs — they're chains where bugs interact.

In the first production run, ThreadTrace found a .find() with a missing return statement that made an entire retry path dead code. While tracing that path, it discovered three unawaited async calls that create a race condition — currently invisible because the dead retry prevents them from executing.

Fixing the .find() without fixing the awaits would activate the race condition. The bugs are linked. Neither is visible in isolation. This is why ThreadTrace continues investigating after finding a bug instead of stopping.


Limitations

It's optimized for TypeScript/Node.js. It requires Claude Code for multi-agent spawning. Each full investigation uses 50-100K tokens. It's static analysis — no runtime reproduction. And it's not a replacement for tests. It finds the bugs you didn't know to test for.

ThreadTrace on GitHub