Eval discipline is the missing infrastructure

2026-05-02

A thread appeared on HN last week: someone explaining, in detail, why Codex worked better than Claude Code for their production monolith. Specific failure modes, concrete examples, honest comparison. The kind of thing that should be straightforward to evaluate — run your codebase through both, measure what actually breaks.

What the thread revealed, indirectly, was that nobody had done that. The person had been going on vibes for months before the mismatch became obvious enough to post about. That's not a failure of the individual. It's a structural problem: evaluation infrastructure for AI-assisted development is nonexistent at most teams.

The tooling gap is real

Build a web app and you have test suites, CI/CD pipelines, error rates in production. You know when something broke. Swap in an LLM code assistant and you lose all of that. Your "eval" is whether the feature works, which involves the entire stack, most of which you didn't touch. Attributing failures to the model vs your own code is nearly impossible without deliberate instrumentation.

This is why benchmark culture matters. Not because benchmarks are ground truth — they're not — but because they're a forcing function. When you have to measure something, you have to decide what matters. That decision reveals your actual priorities, not the ones in your README.

The benchmark situation in AI is bad. Every foundation model provider publishes numbers that are hard to compare across tasks, often leaked from unreproducible runs, and selected to look good against competitors. But bad benchmarks are better than no benchmarks, because they at least give you a thread to pull on.

What eval discipline actually looks like

It starts with task-level measurement. Before you commit to a tool, you define what "better" means for your work. Not just "feels faster" — actual output quality on your actual data. If you're using a code model, that means your eval set is a sample of your codebase, with known failure modes, run through both tools and compared on correctness, not vibes.

For retrieval systems — which is where I've been spending time — eval discipline means measuring downstream task accuracy, not top-k hit rate. The index gets 92% recall on a benchmark. But does it help the agent answer the user's question correctly? That's a different measurement, and the gap between them can be large.

For IndexZero specifically, we built the eval infrastructure first, before anything else. Students implement a reference IR system, run it against a standardized test collection, and every implementation is scored on the same metrics. The point isn't that we have great benchmarks — we're using standard academic collections. The point is that there's a measurement where there used to be none, and that changes what the learning process looks like.

The discipline tax is real, and most teams skip it

It's not free to build eval infrastructure. It takes time. It requires you to define what good looks like before you have a working system. It produces numbers that sometimes disappoint you. For early-stage teams moving fast, all of that feels like friction against the thing they're trying to build.

The bet is that the friction is worth it later. A team that knows, in numbers, how their system performs can make better decisions about where to invest. They catch regressions before they reach production. They have data instead of opinions when a vendor sends them a benchmark comparison.

Most teams don't make that bet. Then they spend months on the wrong stack and post about it on HN.

References

Why Codex works better than Claude Code for my production monolith — news.ycombinator.com/item?id=47945185
SimEval-IR: A Unified Toolkit and Benchmark Suite for Evaluating User Simulators and Search Sessions — arxiv.org/abs/2604.27878