Arc
← Writing

Retrieval for agents is a different stack than retrieval for humans

2026-04-29

I've been running a lot of retrieval experiments lately. Both for myself and for the IndexZero course, where students build IR systems from scratch. The thing that's becoming obvious: the retrieval patterns that work for humans don't always work for agents, and the difference matters more as you move from demos into production.

Here's what I mean.

The human retrieval assumption

When you build RAG for a human reading the results, you're optimizing for comprehension. The retrieved chunk needs to make sense standalone, have enough context to be understood, and be cohesive enough that the human can act on it. You chunk around paragraphs, you add overlap to capture cross-reference, you trade precision for recall because a human can scan and discard.

That model breaks differently for agents.

Where it falls apart

Agents need factual grounding, not coherent explanation. A human searching "project-x status" is satisfied with a paragraph that contextualizes the answer. An agent searching the same thing needs the actual status value, the date it was recorded, and the field name it lives in — because it will use that as a tool argument, not display it as text. Vector similarity on narrative text misses this because the most semantically similar paragraph is usually the most explanatory, not the most factual.

Chunk boundaries hit agents harder. RAG chunking is a human-readable hack — split at paragraphs, add 20% overlap, ship it. When an agent processes a chunk, it loses whatever context was on the other side of the boundary. A human reader recovers that from their own world model. An agent that only sees the chunk has no idea what preceded it. This is why agent memory architectures are increasingly moving toward whole-document retrieval with selective extraction, rather than pre-chunked passages.

Agents propagate errors downstream in non-obvious ways. A human who gets a slightly wrong answer pauses and re-reads. An agent that gets a slightly wrong retrieval result often produces a confident, coherent, and wrong answer — because the error propagated through a reasoning chain that obscured the source. This is why retrieval quality metrics for agents need to measure downstream task accuracy, not just top-k hit rate.

The practical pattern I'm seeing

The best agent retrieval systems I'm studying share a structure: a parametric memory layer that decides what to retrieve, a functional retrieval layer that does the fetching, and a grounding layer that validates retrieved content against known facts. This is different from classic RAG which is mostly "embed query, return chunks."

The parametric memory head — there's a recent paper on this, A Parametric Memory Head for Continual Generative Retrieval — makes retrieval decisions throughout the generation process, triggered by uncertainty signals, by mentions of entities that haven't been grounded yet, by task boundaries. More compute at runtime, but it avoids the "retrieve once, propagate error" failure mode.

What this means for systems being built now

If you're building a retrieval system that will serve agents — not just answer questions from humans — the design points are different:

Store structured facts, not narrative chunks. A product spec database with field-level retrieval will serve an agent better than a vector database of product description paragraphs.

Measure task accuracy, not chunk recall. The evaluation dataset for your retrieval system should include downstream agent tasks, not just human satisfaction scores.

Plan for multi-stage retrieval. Agents benefit from a recall stage — high recall, loose threshold — followed by a grounding stage that checks retrieved facts against a known-good source. Single-stage retrieval fails on the error propagation problem.

The RAG stack was built for human readers. Agent-first retrieval is a different problem, and I think the teams that figure out the agent-native retrieval patterns are going to have a real advantage — because the naive approach is already hitting its limits in production.


References