In RAG, Retrieval Quality Beats Generation Quality Per Dollar

A mediocre retriever is worse than no retriever at all. In the Cuconasu et al. (2024) experiments, inserting random documents into the LLM context improved accuracy by up to 35% in some settings, while inserting the retriever's confident-but-wrong passages degraded it. That finding inverts the usual mental model of RAG. Retrieval is not a neutral preprocessor you upgrade later; it bounds everything downstream, and does the most damage when mediocre. The argument across the hybrid search chapters is that retrieval, not the LLM, is where each engineering dollar moves output quality the most.

The evidence

Coupling a neural retriever with a sequence-to-sequence generator produces more factual, specific, and diverse outputs than parametric-only baselines. In the original RAG fact-verification evaluation, a gold evidence article appeared in the top 10 retrieved articles in 90% of cases, and generation quality tracked retrieval accuracy at that rate (Lewis et al., 2020). In a study of three real-world RAG deployments across research, education, and biomedical domains, four of seven identified failure points were retrieval-related: missing content in the index, relevant documents not reaching the top ranks, failure to consolidate across passages, and extraction errors from retrieved text. Only three originated in the generation stage (Barnett et al., 2024).

The asymmetry is structural. A generation model can only work with the context it receives. If retrieval surfaces the wrong passages, no prompt engineering will recover; if it surfaces the right ones, a mid-tier LLM often does fine.

Retrieved-but-irrelevant is worse than random

The Cuconasu finding reframes the RAG retrieval objective, but not as a simple shift to precision. The LLM treats confidently retrieved passages as authoritative, which amplifies any precision failure. At the same time, any answer-bearing passage that never enters the candidate set cannot be recovered downstream. Recall@100 at the first stage sets an absolute ceiling on generation quality.

The resolution is a two-stage pipeline. Stage one, hybrid retrieval, is explicitly recall-first: pull 50 to 100 candidates from BM25 and vector legs in parallel so that every passage needed to answer the query is somewhere in the pool, even at the cost of including irrelevant ones. Stage two, cross-encoder reranking, is where precision is recovered: aggressively filter that pool down to 5 to 10 passages, pushing hard negatives (topically related but factually irrelevant passages) below the cutoff so they never reach the LLM. A high-recall retriever with a loose cross-encoder reranker produces worse RAG output than the same retriever paired with a stricter one. You need both halves.

What hybrid adds in the RAG setting

RAG retrieval is usually pitched as a pure-vector problem. That view underestimates what only hybrid retrieval solves cleanly. BM25 catches exact entity matches, numerical constraints, versioned identifiers, and other tokens where substitution is not acceptable, precisely the queries where the exact-match degradation covered in the post on vector search failure modes shows up. Vector retrieval catches semantic paraphrases, synonym variation, and rewordings where the surface tokens never align. Running only one leg risks missing a substantial fraction of relevant passages the other leg would have caught, and those misses are absorbed into the recall ceiling before reranking ever runs.

The unit of retrieval in RAG is also different: chunks, not documents. Chunking size and overlap interact with lexical and dense signals in different ways. BM25 depends on term frequency and length normalization, so short chunks amplify each term occurrence while long chunks dilute it. Vector retrieval compresses a chunk into a single embedding regardless of length. The same chunk granularity can produce very different recall on each leg, so chunking has to be tuned against both.

The unresolved tension

The diagnosis is now clear: retrieval quality bounds RAG output, and hard negatives are not just unhelpful, they are actively harmful. What is not yet clear is how to turn that diagnosis into a design. A recall-first hybrid retriever followed by an aggressive cross-encoder reranker is the right shape, but shape is not enough. Chunking interacts differently with BM25 and vector retrieval, and no controlled study has isolated the effect across both legs simultaneously. LLMs attend in a U-shaped pattern across the context window, which means a passage at rank 2 and one at rank 19 may be read equally well while a passage at rank 10 is ignored, a property no standard ranking metric captures. Faithfulness and distractor penalty, the two dimensions that matter most in RAG, are invisible to NDCG and MAP, which is why frameworks like RAGAS, RAGChecker, and metrics like UDCG exist at all.

So the question is not whether retrieval matters. The question is what a retrieve-rerank-stuff pipeline looks like when every stage is tuned for a consumer that is not a human, evaluated against metrics your search team does not already run. That is where the design work begins.

In RAG, Retrieval Quality Beats Generation Quality Per Dollar

The evidence

Retrieved-but-irrelevant is worse than random

What hybrid adds in the RAG setting

The unresolved tension

Chapter 18: Hybrid Search for RAG Pipelines

Laszlo Csontos

Related Posts