The Essential RAG Book
Multi-Stage Retrieval
Multi-Stage Retrieval decomposes retrieval into a fast candidate generation phase followed by a high-precision reranking phase. The first stage maximizes recall using inexpensive models and large fan-out; the second stage maximizes precision using compute-heavy cross-encoders ...
TL;DR
Multi-Stage Retrieval decomposes retrieval into a fast candidate generation phase followed by a high-precision reranking phase. The first stage maximizes recall using inexpensive models and large fan-out; the second stage maximizes precision using compute-heavy cross-encoders or late-interaction scoring. This design...
Key Takeaways
- Multi-Stage Retrieval decomposes retrieval into a fast candidate generation phase followed by a high-precision reranking phase.
- Why multi-stage. A single retriever must trade off recall vs precision.
Multi-Stage Retrieval decomposes retrieval into a fast candidate generation phase followed by a high-precision reranking phase. The first stage maximizes recall using inexpensive models and large fan-out; the second stage maximizes precision using compute-heavy cross-encoders or late-interaction scoring. This design is standard in web search and adapts well to RAG.
- Stage 1: Candidate Generation (High Recall)
- Sparse BM25 / Keyword
- Dense ANN (HNSW / IVF)
- Filters: time, tags, ACL
-> top-N docs
- Stage 2: Reranker (High Precision)
- Cross-Encoder score(q, d)
- Late Interaction (e.g., ColBERT)
- Diversification (MMR)
-> top-k contexts [Generator (LLM)]
Figure 7 – Two-stage retrieval: recall-first candidate generation followed by precision reranking.
Why multi-stage. A single retriever must trade off recall vs precision. Multi-stage designs separate concerns: use broad, cheap retrieval to avoid missing relevant evidence, then apply expensive scoring to a narrowed set. This reduces latency and token cost while boosting groundedness. Candidate generation. Combine sparse (BM25) and dense ANN indexes with permissive filters to yield N=100..1000 candidates. Normalize scores and union results; de-duplicate by document ID and shard. Time and ACL filters constrain visibility. Reranking. Apply a cross-encoder f(q, d) that jointly attends to the query and passage, or a late-interaction model that computes max-sim over token embeddings. Include diversification (e.g., MMR) to reduce redundancy and improve coverage of subtopics. Practical tips. Tune fan-out (N) and final top-k by domain; log recall@k against labeled sets. Cache cross-encoder scores; use approximate rerankers for speed-sensitive tiers. Monitor token impact of retrieved contexts in generation. When to use. Choose multi-stage retrieval when corpora are large, heterogeneous, or noisy; when baseline dense-only systems miss edge cases; or when precision is critical (support escalations, legal, medical).


