Arc
← Writing

On hybrid search: why your vector DB is not a search engine

2026-04-29

Every retrieval system I've studied eventually runs into the same wall. Someone drops in a vector database — Pinecone, Weaviate, whatever — fires up semantic search, and declares the RAG problem solved. It isn't. Not because the model is bad, but because the retrieval layer has a fundamental identity crisis.

A vector database stores embeddings. It does not know how to count term frequencies, handle phrase queries, or respect field weights. That's not a bug — it's a different tool. And when you build a search system on a storage format instead of a search engine, you end up with something that feels smart but breaks in predictable, embarrassing ways.

The three failure modes

Dense embeddings miss exact matches. "Show me all documents from project-x" — a semantic search might return documents that are about project-x but contain neither the name nor the identifier. Meanwhile, a traditional inverted index would have matched project-x exactly.

BM25 misses intent. Someone searches "how do I scale the ingestion pipeline" — lexical matching on "scale ingestion pipeline" returns nothing relevant if the document says "handling throughput in data pipelines" instead. The meaning is the same. The words aren't.

Fixed-weight hybrids are fragile. 60/40 dense-to-sparse sounds reasonable until you hit a domain where exact matches dominate. Or a query where semantic similarity is all that matters. The weight that works for your training set is wrong for your production queries.

What dynamic hybrid actually means

Dynamic hybrid is not "run both and merge by score." It's: given this specific query, this specific document corpus, and this specific user intent — what is the right weighting?

Concretely: at query time, you evaluate the retrieval candidates from both the dense and sparse arms, then use a learned model or a calibrated scoring function to merge results. Some implementations use a cross-encoder re-ranker. Some use a lightweight logistic regression. The point is that the fusion decision is made per-query, not fixed at index time.

The signal that makes this work: you need enough data to know when dense outperforms sparse and vice versa. For a small corpus, a fixed 50/50 split often beats a learned fusion — there's not enough variation to learn from. Scale changes the calculus.

What I keep coming back to

The interesting insight from the research I'm tracking: the retrieval signal that matters most is often not raw semantic similarity or raw BM25 score. It's the difference between the top-ranked result and the second-ranked result in each arm. A large gap means the arm is confident. A small gap means the query is ambiguous — which is exactly when you want the other arm to compensate.

This is the kind of thing that doesn't show up in a "we added hybrid search" announcement. It shows up when you're debugging why your recall dropped by 8% after the last index update, and you realize the new documents changed the score distribution in a way your fixed weights couldn't handle.

The practical takeaway: if you're running a vector DB in production and not monitoring dense vs. sparse retrieval quality separately, you're flying half blind.