What our retrieval experiments actually taught us

2026-04-29

A few weeks ago I wrote about why the classic retrieval stack breaks for agents. MAD, verifiable APIs, precision over recall. The claims felt right based on the papers. But papers are not evidence — they're arguments. So we ran the experiments.

Four experiments. SQuAD passages, a local Qdrant instance, and a sentence transformer model. Here's what the data actually showed.

Experiment 1 — Distractor Cost vs Missing Relevant Docs

Hypothesis What we expected

A plausible-but-wrong document is worse than no document. Distractor degrades accuracy more than removal.

Setup How we tested it

Found questions where correct answer appeared in top-5. Created two variants: (A) removed the correct doc, (B) replaced it with the most similar non-answer doc. Measured accuracy drop in both.

Corpus Test environment

SQuAD train[:1000], all-MiniLM-L6-v2 embeddings, 384-dim, cosine similarity

Result

Inconclusive 32 evaluable examples. 0 cases where distractor outranked correct doc. SQuAD's short passages and strong query-answer overlap means distractors never win.

Finding What we learned

MAD is corpus-dependent. On clean academic corpora this failure mode doesn't manifest. Requires noisy, large-scale environments (BrowseComp-Plus scale) where semantic adjacency is genuinely ambiguous.

Experiment 2 — Low-k vs High-k Precision

Hypothesis What we expected

Retrieving 3 high-precision docs outperforms retrieving 10 loosely relevant ones. Classic MAD logic: more docs = more distraction risk.

Setup How we tested it

Ran retrieval at k=3, k=5, k=10, k=15, k=20. Measured answer hit rate at each k on 500 validation questions.

Corpus Test environment

SQuAD validation[:500], same indexed collection as Exp 1

Result

Contrary to hypothesis k=3: 8.4% → k=5: 8.4% → k=10: 12.8% → k=15: 15.6% → k=20: 17.4%. More docs consistently helps. No distractor penalty observed.

Finding What we learned

Same root as Exp 1 — clean corpus means more k = more coverage without precision penalty. Low-k principle applies in adversarial environments, not clean ones.

Experiment 3 — IR Metrics vs Downstream Accuracy

Hypothesis What we expected

Classic IR metrics (nDCG, MRR) don't predict whether the model actually answers correctly. UDCG's 36% better correlation holds.

Setup How we tested it

Computed nDCG@10 and hit@10 for 200 validation questions. Measured Spearman correlation between the two. Also compared BM25 vs dense embedding accuracy directly.

Corpus Test environment

SQuAD validation[:200] against indexed train passages

Result

Partially supported Spearman correlation 0.996 — but circular because both use binary answer-presence. Real finding: BM25 = 10% accuracy, Dense = 16.3%. Different retrievers disagree on what "relevant" means even when they rank similarly.

Finding What we learned

nDCG vs accuracy divergence shows up when retrievers disagree. On SQuAD they agree. On noisy corpora with hard negatives they likely don't. UDCG advantage is probably real in adversarial environments.

Experiment 4 — Dynamic Fusion vs Fixed-Weight Hybrid

Hypothesis What we expected

Per-query-type fusion weights (factual vs conceptual) outperform fixed 50/50 BM25+dense.

Setup How we tested it

Classified queries as factual (who/when/where) vs conceptual (why/how). Built BM25 + dense retrievers. Compared fixed 50/50 hybrid vs per-retriever accuracy per query type.

Corpus Test environment

SQuAD validation[:300], rank-bm25 + all-MiniLM-L6-v2

Result

Supported Dense: 9.4% factual, 23.2% conceptual | BM25: 8.1% factual, 11.9% conceptual | Fixed hybrid: 10.1% factual, 18.5% conceptual. Fixed fusion hurts conceptual (drags dense from 23.2% to 18.5%).

Finding What we learned

Query-type routing is the right architecture. Dense is much better for conceptual, BM25 slightly better for factual. Fixed fusion is the wrong default — it optimizes for the wrong query type.

What this means for the original claims

The claims from the earlier post don't collapse — they get qualified by corpus type:

"Precision over recall" — holds in noisy environments, not clean ones. On SQuAD, recall consistently helps. The distractor risk that MAD describes is real but corpus-dependent.

"Low-k as primary stabilization" — only in adversarial retrieval. On clean corpora, higher k improves accuracy monotonically.

"UDCG outperforms nDCG" — probably true, but only in environments where retrievers disagree on what "relevant" means. On SQuAD they agree and nDCG works fine.

"Query-type routing beats fixed fusion" — supported. The data shows clear query-type specialization that fixed fusion can't exploit.

The information theory framing from the earlier post — retrieval as context injection, entropy of distractors, mutual information — is still the right way to think about this. But the experimental support for the specific claims is limited to the conditions we tested. Clean corpora and noisy production environments behave differently, and the design principles need to know which regime you're in.

References

Retrieval for agents: Why the classic stack breaks — previous post
Mutually Assured Distraction — hornet.dev
BrowseComp-Plus — arxiv.org/abs/2508.06600
Redefining Retrieval Evaluation in the Era of LLMs — arxiv.org/abs/2510.21440