Arc
← Writing

Retrieval for agents: Why the classic stack breaks

2026-04-29

On BrowseComp-Plus, a multi-hop question answering benchmark for agents, two numbers sit next to each other:

That gap is not a reasoning failure. It's a retrieval failure. The reasoning model is the same; the evidence access is the difference. Give the model what it needs and it answers correctly. Make it find that evidence through search and accuracy collapses.

This is the core empirical observation that reframes everything about agentic retrieval.

The human search metaphor is wrong

Classic search is built on the human scanning model. A user sees a ranked list, skims titles and snippets, clicks what looks right, skips what doesn't. Irrelevant results are discarded cheaply — you read the next line and move on.

LLM-based agents can't do that. When a language model processes input, every token competes for probability mass through self-attention. There is no "skip" and no "ignore." The retrieved documents don't appear as options to consider — they are ingested and actively shape the model's probabilistic reasoning. A plausible but wrong document doesn't just sit there looking suspicious; it becomes a premise the model reasons from.

This means the failure mode for agentic retrieval is categorically different from human search. A bad result for a human is a wasted click. A bad result for an agent is corrupted reasoning.

Mutually assured distraction

Hornet's Lester Solbakken named the dynamic precisely: mutually assured distraction (MAD).

The insight: as retrievers improve, their mistakes get better too. A strong retriever surfaces semantically adjacent, high-confidence, plausible-sounding documents. These are the most dangerous outputs — they survive ranking because they score well on similarity signals. They look like evidence.

In human search this is fine; humans discount plausibility with experience. In agentic pipelines this is catastrophic:

  1. The distractor enters the context window as a validated premise
  2. The model's reasoning chain incorporates it
  3. If the next retrieval step builds on prior output, the error compounds

The MAD dynamic: better retrievers produce more convincing distractors; better reasoners trust those distractors more deeply. Both sides improve; both sides lose.

There's an uncomfortable implication from "The Power of Noise": random noise can actually help accuracy by increasing attention entropy, preventing the model from latching onto any single source. Distractors — plausibly irrelevant text — do the opposite. They look credible, so attention sharpens around the wrong evidence.

And from "Lost in the Noise": hard negative distractors dropped accuracy by up to 80% in multi-hop and tool-use scenarios. Not because the model is weak, but because the context is wrong.

This is why you cannot reason your way out of bad context. You cannot compute your way out of distraction.

The metrics are measuring the wrong thing

Classic Information Retrieval metrics were built for human search: nDCG, MAP, MRR. They share two assumptions that break for agents:

Monotonic decay — Traditional metrics assume document value decreases smoothly with rank position. A document at rank 1 is much more valuable than at rank 5. This models a human scanning top to bottom.

Harmless zeroes — Irrelevant documents get zero utility: a "miss" but not a penalty. This models a human ignoring a result.

For agents, both assumptions fail. An LLM ingests the entire retrieved bundle at once, not a list it scans. And "irrelevant" is not one thing — some passages are actively harmful. A retriever can score high on nDCG while consistently injecting distractors that tank downstream reasoning. The metric improves while the system gets worse.

This is the incentive misalignment at the heart of MAD. If your metric can't penalize harm, the locally rational strategy is to push recall as high as possible. Higher recall almost always means lower precision. You accept more borderline candidates to avoid missing anything. Those borderline candidates are the perfect distractors.

The response from "Redefining Retrieval Evaluation in the Era of LLMs": UDCG (Utility and Distraction-aware Cumulative Gain). Assign negative utility to distractors — passages that make the model answer when it should abstain. Utility is derived from model behavior: positive if it helps the model answer correctly, negative if it causes a confident wrong answer.

The result: UDCG correlates up to 36% better with end-to-end answer accuracy than classic IR metrics. And it exposes the key shift for agentic retrieval: precision is more valuable than recall. A miss costs less than a distractor.

What verifiable APIs change

Hornet's Skip Everling frames the solution as a feedback loop problem. The insight: the same property that makes code learnable through RL — verifiability — is what retrieval needs for agents.

A coding agent succeeds because it can observe failure. Give it a compiler, a test suite, and a way to verify output, and it can self-correct through the error signal. The agent tries something, sees that it failed, reads the error, and corrects. The feedback loop is the learning mechanism.

Retrieval becomes learnable when the same is true: when an agent can observe that a configuration is wrong, see why it's wrong, and correct it. Hornet builds this through three verification levels:

Syntactic validation — APIs defined by OpenAPI specification. Agents create schemas that compile correctly. Frontier LLMs are excellent at syntactically correct configuration.

Semantic validation — Deeper validation across combinations. Some settings can't be used together; syntax checks can't catch this. A configuration model captures which combinations are allowed and returns concrete, detailed error messages so the agent can self-correct.

Behavioral validation — Does the engine behave as expected? Do the right documents appear, ranked correctly? This is the hardest level because "correct" is often subjective. But by making quality metrics observable and comparable, agents can not just query Hornet but tune relevance, adjust recall/latency tradeoffs, and safely roll out production changes.

With behavioral validation, an agent that notices its retrieval keeps missing recent policy updates can adjust its query configuration, test against known-good results, and deploy the fix — without a human in the loop.

Design principles for agentic retrieval

Measure what matters — Replace nDCG with utility-based metrics that penalize distractors. If you measure relevance but ignore harm, you'll optimize for recall and produce fragility.

Prefer precision over recall — A miss costs less than a distractor. In agentic pipelines, the cost of a plausible wrong document is higher than the cost of failing to retrieve a relevant one. Dynamic-k retrieval — pull until sufficiency threshold is met, stop when the next candidate doesn't increase confidence — operationalizes this.

Treat context as a security boundary — The same architectural vulnerability that enables accidental distraction enables prompt injection. Untrusted content that enters the context window can shape behavior in ways the model can't reliably distinguish from legitimate instruction. Defensive retrieval is also defensive security.

Make abstention a first-class outcome — "Insufficient evidence" should be a control signal, not a fallback to weak reasoning. Trigger a retry with a different query, different retrieval strategy, or different tool. Forcing the model to answer from inadequate context is how agentic collapse starts.

Low-k as primary stabilization — With fewer documents, each one carries more weight. This limits how much distracting context can enter, keeps the attention budget concentrated, and reduces the surface area for a near-miss to seed a long failure chain. The tradeoff: with low-k, your retriever must score precision extremely well.

The architecture this implies

The five-layer design still holds, but the priorities shift given what MAD exposes:

Layer 1: Query understanding and decomposition
  → Agents produce longer, programmatic queries; keyword strategies underperform
  → Query intent matters more than query length

Layer 2: Multi-index retrieval with fusion
  → Multiple retrieval strategies in parallel; combine with learned weights
  → Score functions must distinguish "useful" from "harmful" not just "relevant"

Layer 3: Grounding and verification
  → Syntactic, semantic, and behavioral validation at retrieval time
  → Re-ranking against utility metrics, not just relevance

Layer 4: Context window management
  → Low-k, sufficiency-based retrieval; dynamic document count
  → Abstention triggers retry, not fallback reasoning

Layer 5: Feedback and learning
  → Agent observes retrieval outcomes, adjusts configuration autonomously
  → Behavioral metrics as the learning signal, not IR metrics

The key shift: retrieval for agents is not a search problem. It's a context quality problem. The question is not "did we find relevant documents?" It's "does this context enable the model to answer correctly — and if not, what's the cost of the wrong answer?"

What this means to build

If the 93% → 14% gap is real, then building for agents means:

The 93% number is the ceiling for reasoning. The 14% is the floor with weak retrieval. Everything interesting happens in between — and it's mostly a retrieval design problem.


References