In RAG, Retrieval Quality Beats Generation Quality Per Dollar
Upgrading the LLM is the most visible decision in a RAG pipeline. Upgrading retrieval usually moves output quality more, and for less money.
A mediocre retriever is worse than no retriever at all. In the Cuconasu et al. (2024) experiments, inserting random documents into the LLM context improved accuracy by up to 35% in some settings, while inserting the retriever's confident-but-wrong passages degraded it. That finding inverts the usual mental model of RAG. Retrieval is not a neutral preprocessor you upgrade later; it bounds everything downstream, and does the most damage when mediocre. The argument across the hybrid search chapters is that retrieval, not the LLM, is where each engineering dollar moves output quality the most.
The evidence
Coupling a neural retriever with a sequence-to-sequence generator produces more factual, specific, and diverse outputs than parametric-only baselines. In the original RAG fact-verification evaluation, a gold evidence article appeared in the top 10 retrieved articles in 90% of cases, and generation quality tracked retrieval accuracy at that rate (Lewis et al., 2020). In a study of three real-world RAG deployments across research, education, and biomedical domains, four of seven identified failure points were retrieval-related: missing content in the index, relevant documents not reaching the top ranks, failure to consolidate across passages, and extraction errors from retrieved text. Only three originated in the generation stage (Barnett et al., 2024).
The asymmetry is structural. A generation model can only work with the context it receives. If retrieval surfaces the wrong passages, no prompt engineering will recover; if it surfaces the right ones, a mid-tier LLM often does fine.
Retrieved-but-irrelevant is worse than random
The Cuconasu finding reframes the RAG retrieval objective, but not as a simple shift to precision. The LLM treats confidently retrieved passages as authoritative, which amplifies any precision failure. At the same time, any answer-bearing passage that never enters the candidate set cannot be recovered downstream. Recall@100 at the first stage sets an absolute ceiling on generation quality.
The resolution is a two-stage pipeline. Stage one, hybrid retrieval, is explicitly recall-first: pull 50 to 100 candidates from BM25 and vector legs in parallel so that every passage needed to answer the query is somewhere in the pool, even at the cost of including irrelevant ones. Stage two, cross-encoder reranking, is where precision is recovered: aggressively filter that pool down to 5 to 10 passages, pushing hard negatives (topically related but factually irrelevant passages) below the cutoff so they never reach the LLM. A high-recall retriever with a loose cross-encoder reranker produces worse RAG output than the same retriever paired with a stricter one. You need both halves.
What hybrid adds in the RAG setting
RAG retrieval is usually pitched as a pure-vector problem. That view underestimates what only hybrid retrieval solves cleanly. BM25 catches exact entity matches, numerical constraints, versioned identifiers, and other tokens where substitution is not acceptable, precisely the queries where the exact-match degradation covered in the post on vector search failure modes shows up. Vector retrieval catches semantic paraphrases, synonym variation, and rewordings where the surface tokens never align. Running only one leg risks missing a substantial fraction of relevant passages the other leg would have caught, and those misses are absorbed into the recall ceiling before reranking ever runs.
The unit of retrieval in RAG is also different: chunks, not documents. Chunking size and overlap interact with lexical and dense signals in different ways. BM25 depends on term frequency and length normalization, so short chunks amplify each term occurrence while long chunks dilute it. Vector retrieval compresses a chunk into a single embedding regardless of length. The same chunk granularity can produce very different recall on each leg, so chunking has to be tuned against both.
The unresolved tension
The diagnosis is now clear: retrieval quality bounds RAG output, and hard negatives are not just unhelpful, they are actively harmful. What is not yet clear is how to turn that diagnosis into a design. A recall-first hybrid retriever followed by an aggressive cross-encoder reranker is the right shape, but shape is not enough. Chunking interacts differently with BM25 and vector retrieval, and no controlled study has isolated the effect across both legs simultaneously. LLMs attend in a U-shaped pattern across the context window, which means a passage at rank 2 and one at rank 19 may be read equally well while a passage at rank 10 is ignored, a property no standard ranking metric captures. Faithfulness and distractor penalty, the two dimensions that matter most in RAG, are invisible to NDCG and MAP, which is why frameworks like RAGAS, RAGChecker, and metrics like UDCG exist at all.
So the question is not whether retrieval matters. The question is what a retrieve-rerank-stuff pipeline looks like when every stage is tuned for a consumer that is not a human, evaluated against metrics your search team does not already run. That is where the design work begins.
Related chapter
Chapter 18: Hybrid Search for RAG Pipelines
Choosing an LLM matters far less in a RAG system than ensuring the retrieval layer surfaces the right passages at the right granularity. This chapter adapts the general hybrid retrieval architecture to the RAG setting, where the retrieval unit becomes a chunk rather than a full document, the downstream consumer is an LLM instead of a human skimming a result list, and the success bar moves from ranking quality toward generation faithfulness and groundedness.
Get notified when the book launches.
Laszlo Csontos
Author of Designing Hybrid Search Systems. Works on search and retrieval systems, and writes about the engineering trade-offs involved in combining keyword and vector search.
Related Posts
A product query like 'red running shoes under $120' mixes semantic intent with exact filtering. Neither pure keyword nor pure vector search handles that, and hybrid is only a partial answer.
December 15, 2025
Enforcing document-level access control inside a vector search index is an architectural decision, not a runtime filter. Getting it wrong leaks data in subtle ways.
December 8, 2025