The Vocabulary Mismatch Problem: Why BM25 Fails Silently
Users pick the same word for the same concept less than 20% of the time. Keyword search cannot bridge that gap, and most systems never measure it.
A search system can look perfectly healthy from the outside while failing silently on a substantial fraction of its queries. The reason is not a bug, it is a thirty-year-old assumption: that users and authors will pick the same words for the same ideas. They do not, and that mismatch is the single most important reason hybrid search exists. The rest of this piece lays out the case for combining lexical and dense retrieval, starting from this one observation.
Users and authors rarely agree on words
The foundational study of this phenomenon is still the one everyone cites. When two people were asked to name the same object, they chose the same term under 20% of the time (Furnas et al., 1987), with an earlier, more technical version of the same work reporting agreement of only 10-20% across similar tasks (Furnas et al., 1983). Put differently, if a designer picks one label for a concept, most users searching for that concept will type something else entirely. An inverted index that matches on literal tokens is therefore guaranteed to miss a large fraction of relevant documents, not because the ranking is wrong, but because the query and the document never share a token in the first place.
The effect compounds inside real collections. A downstream retrieval study found that an average query term fails to appear in 30-80% of the documents that are actually relevant to the query (Zhao et al., 2010). When multiple terms must all match, as in the default AND behavior of most engines, the probability of missing a relevant document grows multiplicatively with each additional term.
This is not a historical artifact. The BEIR benchmark, which evaluates retrievers across 18 heterogeneous datasets, confirmed that BM25 still underperforms on semantically complex tasks like duplicate question detection and open-domain QA, even as it remains competitive on entity retrieval and biomedical search (Thakur et al., 2021). The problem has shifted in shape, not in nature: whenever query vocabulary diverges from document vocabulary, BM25 has nothing to match on.
Why the traditional patches are not enough
Search teams have tried to close the gap for decades with synonym dictionaries, stemming, spell correction, and query expansion. Each of these helps. None of them solves the underlying problem. Synonym lists are labor-intensive and always incomplete. Stemming folds too aggressively on some tokens and too weakly on others. Classical query expansion via pseudo-relevance feedback can drift off-topic entirely when the top results are already wrong.
The most comprehensive survey of automatic query expansion puts this bluntly: current techniques "are optimized to perform well on average, but are unstable and may cause degradation of search service for some queries" (Carpineto and Romano, 2012). That instability is precisely why major commercial engines have not adopted blind query expansion, even though it has been studied for almost half a century. The deeper issue is that these techniques stay inside the lexical paradigm. They try to predict which alternative tokens the user might have chosen, rather than matching on meaning. A thesaurus can list "car, automobile, vehicle"; it cannot know that "something to get the kids to school" belongs in the same cluster.
What this means for your system
If you run a production search system, the question is not whether vocabulary mismatch is hitting you. It is whether you are measuring it. Zero-result rate is the first proxy, but it hides the more insidious failure: queries that return something, just not the right thing. Segmented metrics on head versus tail queries, or on entity-rich versus natural-language queries, expose the gap that aggregate precision hides.
The uncomfortable part is that the lexical fixes cannot close this gap on their own. Decades of synonym expansion, stemming, and relevance feedback have moved the average and left the tail roughly where it was. That leaves a concrete question: how large is the measurable ceiling of pure lexical retrieval, and on which slices of traffic does BM25 still outrun every alternative? Any honest answer has to hold both of those facts at once before reaching for a remedy.
Related chapter
Chapter 1: The Limits of Keyword Search
For three decades, every mainstream open-source search engine has relied on the same core idea: match the terms in a query against an inverted index and rank by a scoring function like BM25. This chapter explains how BM25 works, why that approach has been so durable, and the specific, well-documented ways it fails silently on a substantial fraction of real user queries.
You will receive the introduction and the first two chapters in PDF.
Laszlo Csontos
Author of Designing Hybrid Search Systems. Works on search and retrieval systems, and writes about the engineering trade-offs involved in combining keyword and vector search.
Related Posts
A nearest neighbor always exists, so vector search never returns zero results. That is precisely why its failures are so hard to notice.
April 13, 2026
Hybrid retrieval consistently outperforms keyword-only and vector-only on public benchmarks, but the case for hybrid is about complementary failure sets, not averages.
April 6, 2026