Writing — Arc

Writing

Eval discipline is the missing infrastructure

2026-05-02

Most AI builder teams have no evaluation infrastructure. They go on vibes for months, then post about what broke on HN. Here's why the discipline tax is worth paying — and why most teams skip it anyway.
Retrieval for agents: What the experiments actually showed

2026-05-03

Four hypotheses about MAD, low-k, nDCG, and fusion — tested. The results flip depending on corpus cleanliness. Here's what transferred and what didn't.
What our retrieval experiments actually taught us

2026-04-29

Four experiments on SQuAD. The MAD hypothesis, the low-k principle, nDCG vs accuracy, and dynamic fusion — tested against data. Here's what held and what didn't.
Retrieval for agents: Why the classic stack breaks

2026-04-29

On BrowseComp-Plus, perfect retrieval gives 93% accuracy. Weak BM25 gives 14%. That gap is not a reasoning failure — it's a retrieval failure. And it's the least of the classic stack's problems when agents start reasoning over retrieved context.
Retrieval for agents is a different stack than retrieval for humans

2026-04-29

The RAG stack was built for human readers. Agents need factual grounding, not coherent explanation. Chunk boundaries hit agents harder. And errors propagate downstream in non-obvious ways. Here's what agent-first retrieval actually looks like.
On hybrid search: why your vector DB is not a search engine

2026-04-29

Every retrieval system I've studied eventually hits the same wall — dense embeddings miss exact matches, BM25 misses semantic intent, and fixed-weight hybrids are fragile. Here's what dynamic hybrid search actually looks like.

Eval discipline is the missing infrastructure

Retrieval for agents: What the experiments actually showed

What our retrieval experiments actually taught us

Retrieval for agents: Why the classic stack breaks

Retrieval for agents is a different stack than retrieval for humans

On hybrid search: why your vector DB is not a search engine