-
Eval discipline is the missing infrastructure
2026-05-02Most AI builder teams have no evaluation infrastructure. They go on vibes for months, then post about what broke on HN. Here's why the discipline tax is worth paying — and why most teams skip it anyway.
-
Retrieval for agents: What the experiments actually showed
2026-05-03Four hypotheses about MAD, low-k, nDCG, and fusion — tested. The results flip depending on corpus cleanliness. Here's what transferred and what didn't.
-
What our retrieval experiments actually taught us
2026-04-29Four experiments on SQuAD. The MAD hypothesis, the low-k principle, nDCG vs accuracy, and dynamic fusion — tested against data. Here's what held and what didn't.
-
Retrieval for agents: Why the classic stack breaks
2026-04-29On BrowseComp-Plus, perfect retrieval gives 93% accuracy. Weak BM25 gives 14%. That gap is not a reasoning failure — it's a retrieval failure. And it's the least of the classic stack's problems when agents start reasoning over retrieved context.
-
Retrieval for agents is a different stack than retrieval for humans
2026-04-29The RAG stack was built for human readers. Agents need factual grounding, not coherent explanation. Chunk boundaries hit agents harder. And errors propagate downstream in non-obvious ways. Here's what agent-first retrieval actually looks like.
-
On hybrid search: why your vector DB is not a search engine
2026-04-29Every retrieval system I've studied eventually hits the same wall — dense embeddings miss exact matches, BM25 misses semantic intent, and fixed-weight hybrids are fragile. Here's what dynamic hybrid search actually looks like.