Retrieval for agents: What the experiments actually showed
The last post laid out four hypotheses about agentic retrieval, grounded in the mutually assured distraction (MAD) framework. Four experiments later, here's what held, what didn't, and why the answer depends entirely on where you're searching.
The setup
Test corpus: SQuAD train (1,000 passages, 1000 questions). Retrievers: BM25 (classical, high recall) and all-MiniLM-L6-v2 (dense embeddings). Metric: downstream answer accuracy on a held-out set.
H1: Plausible distractors degrade accuracy more than missing docs
32 evaluable examples. Zero cases where a distractor outranked the correct document.
SQuAD is built for factoid QA — passages are short, query-answer overlap is strong, and the correct answer is usually obvious. In this environment, MAD dynamics simply don't emerge. There's no semantic adjacency trap because the signal is unambiguous.
What this means: The "precision over recall" principle is right for production environments where retrievers operate on noisy, high-diversity corpora (browse_comp-scale). On clean academic corpora, the tradeoff inverts — you want more recall because there's nothing to fear from additional documents.
H2: Low-k precision beats high-k recall
| k | Accuracy |
|---|---|
| k=3 | 8.4% |
| k=10 | 12.8% |
| k=20 | 17.4% |
More documents consistently helps on clean corpora. The distractor risk from high-k doesn't apply when every candidate is likely correct.
What this means: Low-k is a production stabilization strategy, not a universal best practice. On SQuAD, you're just leaving accuracy on the table. The MAD dynamic and the low-k/high-k tradeoff are two sides of the same coin — they both flip based on corpus noise.
H3: Classic IR metrics don't predict downstream accuracy
On SQuAD, nDCG and hit rate are nearly perfectly correlated (Spearman 0.996) — but this is circular on a binary answer-presence corpus. Both metrics reward the same thing: whether the document contains the answer span.
The real finding is harder: BM25 gets 10% accuracy, dense embeddings get 16.3%. Same corpus, same questions, dramatically different downstream outcomes despite both retrievers performing well by IR metrics.
This reveals that nDCG vs accuracy diverges when retrievers disagree on what "relevant" means. On SQuAD they largely agree on ranking. On noisy corpora they don't.
H4: Dynamic fusion outperforms fixed weights
| Retriever | Conceptual | Factual |
|---|---|---|
| Dense only | 23.2% | 9.4% |
| BM25 only | 11.9% | 8.1% |
| Fixed 50/50 hybrid | 18.5% | 10.1% |
Fixed fusion is a compromise — it underperforms dense on conceptual questions and doesn't help factual enough to justify the conceptual cost. The dense retriever dominates conceptual (23.2% vs 18.5%) because BM25 introduces noise that hurts more than it helps.
The architecture this implies: Per-query-type routing. Dense for conceptual queries (where semantic similarity dominates), BM25-weighted for factual (where exact term overlap matters). Fixed fusion averages away the strength of each.
What this changes
The pre-experiment post made four design principles claims:
- Measure what matters — holds. nDCG and accuracy can diverge when retrievers disagree on relevance semantics.
- Prefer precision over recall — corpus-dependent. Right for noisy production corpora, wrong for clean ones.
- Treat context as a security boundary — unchanged. The MAD dynamic makes this more important, not less.
- Low-k as primary stabilization — corpus-dependent. On SQuAD, more k = more accuracy.
The core finding: The five-layer design from the architecture post holds. But the operational priorities shift based on corpus characteristics. On clean, dense corpora, maximize recall (high-k). On noisy, diverse corpora, maximize precision (low-k, utility-based metrics). The retriever choice is secondary to knowing your corpus.
What this doesn't change
The fundamental problem the MAD framework identifies is real — it's just corpus-dependent. On SQuAD, the distractor problem doesn't surface because the corpus is engineered to avoid it. In production systems querying diverse, high-entropy data sources, the dynamics flip entirely.
The honest conclusion: the experiment results validate the pre-experiment post's design principles as corpus-specific optimizations, not universal laws. "Precision over recall" is right for noisy environments. "More k helps" is right for clean ones. The utility-based metric argument holds in both — because it measures what actually affects downstream accuracy, not just rank ordering.
The routing architecture — different retrieval strategies for different query types — is the finding that transfers across both corpus types. That's the real signal.
References
- BrowseComp-Plus — Multi-hop QA benchmark for agents
- Mutually Assured Distraction — L. Solbakken, Hornet
- The Power of Noise — Random noise can increase accuracy by increasing attention entropy
- Lost in the Noise — Hard negative distractors dropped accuracy by up to 80%
- UDCG — Utility and Distraction-aware Cumulative Gain metric
- Verifiable APIs for agents — Skip Everling, Hornet