Glossary

A practitioner's reference for the terms used throughout the book and blog. Each entry links to the chapter or post where the concept appears in depth.

A

ANN (Approximate Nearest Neighbor)

A family of algorithms that trade a small amount of recall for large gains in query latency when searching vector indexes. Production vector search almost always uses ANN (HNSW, IVF, ScaNN, or a proprietary variant) because exact nearest-neighbor search does not scale past a few million vectors. The engineering question is never whether to use ANN, but which variant and how to tune its recall-latency knobs.

Learn more: HNSW Parameter Tuning: M, efConstruction, efSearch Explained (Chapter 15)

See also: HNSW, Vector index, Vector search

B

BEIR

A heterogeneous benchmark for zero-shot information retrieval, covering 18 public datasets across domains like biomedical search, fact-checking, and question answering. BEIR is the standard way to evaluate whether a retrieval model generalizes beyond the training distribution, and it was the first benchmark to show clearly that BM25 remains competitive with neural methods out of domain. Treat BEIR scores as a floor, not a ceiling: a model that does well on BEIR can still flop on your corpus.

Learn more: Embedding Model Selection: MTEB Rank Is Not Enough (Chapter 8)

See also: MTEB, Test collection, NDCG

Bi-encoder

An architecture that encodes the query and each document independently into fixed-length vectors, then ranks by a similarity function (usually cosine or dot product). Bi-encoders are cheap at query time because document vectors are computed once and indexed, but they cannot capture fine-grained query-document interactions the way a cross-encoder can. Most production embedding models are bi-encoders.

Learn more: Cross-Encoder Reranking: The Highest-Leverage Stage in Hybrid Search (Chapter 6)

See also: Cross-encoder, Embedding, Dense retrieval, Late interaction

BM25

A probabilistic ranking function used by virtually every mainstream lexical search engine (Lucene, Elasticsearch, OpenSearch, Solr, Tantivy). BM25 scores documents by how well their term frequencies, scaled by inverse document frequency and a length normalizer, match the query. It is a remarkably strong baseline that survives most attempts to replace it, and it remains a mandatory component of hybrid retrieval.

Learn more: The Vocabulary Mismatch Problem: Why BM25 Fails Silently (Chapter 1)

See also: TF-IDF, Inverted index, Lexical search, Sparse retrieval

C

Chunking

The process of splitting source documents into smaller passages before indexing, so retrieval returns units that are useful to a downstream consumer (usually an LLM in a RAG pipeline). Chunk size, overlap, and boundary strategy (fixed tokens, sentence splitter, semantic segmentation) all affect recall and the groundedness of generated answers. Chunking is often the highest-leverage knob in a RAG system and is under-invested in by teams who treat retrieval as a solved problem.

Learn more: In RAG, Retrieval Quality Beats Generation Quality Per Dollar (Chapter 18)

See also: RAG (Retrieval-Augmented Generation), Embedding, Dense retrieval

ColBERT

A late-interaction retrieval model that encodes queries and documents as sets of token-level vectors and scores them with a MaxSim operator, rather than collapsing each side to a single vector. ColBERT captures finer lexical signal than a bi-encoder while staying cheaper than a full cross-encoder, which makes it a popular choice for high-quality reranking. The trade-off is much larger index storage, because every token vector must be persisted.

Learn more: Cross-Encoder Reranking: The Highest-Leverage Stage in Hybrid Search (Chapter 6)

See also: Late interaction, Bi-encoder, Cross-encoder, Reranking

Contrastive loss

A training objective that pulls semantically related pairs (a query and a relevant document) together in vector space while pushing unrelated pairs apart. InfoNCE and triplet loss are the common variants. Contrastive training is the dominant recipe for producing sentence and document embeddings, and its quality depends heavily on the quality of the negative examples fed to it.

Learn more: The Negatives You Train On Decide Your Embedding Model's Ceiling (Chapter 9)

See also: Embedding, Bi-encoder

Cosine similarity

A similarity measure computed as the dot product of two vectors divided by the product of their magnitudes, so it depends only on the angle between vectors and ignores scale. Most retrieval embedding models are trained to produce unit-normalized vectors so that cosine similarity reduces to a dot product. When vectors are normalized, cosine similarity and dot product rank results identically.

Learn more: Embedding Model Selection: MTEB Rank Is Not Enough (Chapter 8)

See also: Dot product, Embedding, Vector search

Cross-encoder

A ranking model that jointly encodes a query and a candidate document and emits a single relevance score, rather than encoding them independently. Cross-encoders capture term-level interactions that bi-encoders miss, but they cannot be precomputed at index time, so their cost scales with the number of query-document pairs scored. They are the workhorse of the reranking stage in hybrid pipelines.

Learn more: Cross-Encoder Reranking: The Highest-Leverage Stage in Hybrid Search (Chapter 6)

See also: Bi-encoder, Reranking, Late interaction, ColBERT

D

Dense retrieval

Retrieval that ranks documents by the similarity of their dense embedding vectors to a dense query embedding, produced by a neural encoder. Dense retrieval handles semantic matches that lexical methods miss, at the cost of a fixed-size representation that can lose exact-match signal. It is half of a hybrid retrieval system; the other half is lexical.

Learn more: Vector Search Always Returns Something, Even When It Should Not (Chapter 2)

See also: Sparse retrieval, Embedding, Vector search, Semantic search

Dimensionality reduction

Techniques for compressing a high-dimensional vector representation into a smaller one with minimal loss of signal. In hybrid search the relevant instances are Matryoshka-style truncation, PCA, and learned projection heads used to reduce embedding storage and speed up ANN queries. Dimensionality reduction always costs some recall; the question is whether that cost is small enough to justify the savings.

Learn more: Vector Cost Optimization: Matryoshka and Quantization Without the Hype (Chapter 17)

See also: Matryoshka embeddings, Product quantization, Scalar quantization

Distillation

Training a smaller student model to imitate a larger teacher model, usually by matching the teacher's output distribution or intermediate representations. In hybrid search, distillation is the standard way to compress cross-encoder rerankers into bi-encoders, or to shrink a large reranker into one that fits a tight latency budget. A well-distilled student can recover most of the teacher's ranking quality at a fraction of the cost.

Learn more: Reranker Distillation: Cross-Encoder Quality at a Fraction of the Latency (Chapter 10)

See also: Cross-encoder, Reranking, Bi-encoder

Dot product

The sum of the element-wise products of two vectors. Many ANN backends are optimized for dot-product similarity because it maps cleanly to SIMD and matrix multiplication. When embedding vectors are unit-normalized, the dot product ranks results identically to cosine similarity, which is why normalized dot product is the default similarity measure in most production vector indexes.

Learn more: Embedding Model Selection: MTEB Rank Is Not Enough (Chapter 8)

See also: Cosine similarity, Embedding, Vector search

E

Embedding

A learned fixed-length vector that represents a piece of text (or image, or other modality) so that semantically similar inputs produce nearby vectors. Embeddings are the core primitive of dense retrieval: documents are embedded offline and indexed, queries are embedded at runtime, and similarity in vector space proxies for relevance. Choice of embedding model has more blast radius than almost any other decision in a search stack because switching models means reindexing the corpus.

Learn more: Embedding Model Selection: MTEB Rank Is Not Enough (Chapter 8)

See also: Bi-encoder, Dense retrieval, Sentence embeddings, Vector search

Embedding drift

The gradual divergence between the distribution of query or document embeddings in production and the distribution the embedding model was trained or evaluated on. Drift shows up as slow, silent relevance degradation: no single query fails loudly, but the aggregate quality of top-k results slips. Detecting drift requires monitoring distributional statistics of embeddings in production, not just query latency and click-through rate.

Learn more: Embedding Drift Monitoring: Search-Specific Model Degradation (Chapter 16)

See also: Embedding, Dense retrieval

Search over a company's internal document corpora: wikis, ticket systems, code, design docs, email, shared drives, HR policies. The defining constraints are heterogeneous content, per-document access control, compliance and audit requirements, and connector plumbing to dozens of source systems. Retrieval quality is only half the problem; the other half is making the content reachable and filterable in the first place.

Learn more: Enterprise Search Access Control: Decide at Index Time, Not Query Time (Chapter 20)

See also: Vector index, RAG (Retrieval-Augmented Generation)

Entity recognition

Identifying mentions of entities (people, organizations, products, SKUs, locations) in a query or document. In hybrid search, entity recognition feeds query understanding: once the system knows a token is a product code, it can route the query to exact-match retrieval instead of semantic search. Entity confusion is one of the characteristic failure modes of dense retrieval, so explicit entity handling is often the fix.

Learn more: Query Understanding: Why One Retrieval Path Is Never Enough (Chapter 5)

See also: Query expansion, Query routing, Lexical search

F

Fusion (search)

The step in a hybrid pipeline that combines ranked result lists from multiple retrievers (typically one lexical, one dense) into a single merged ranking. The three common approaches are Reciprocal Rank Fusion, weighted linear score interpolation, and learned fusion via a small model trained on labeled query-document pairs. Fusion hides retriever-specific score calibration issues and is usually the first place to look when hybrid scores behave strangely.

Learn more: Hybrid Search vs Vector Search: Where Each Actually Wins (Chapter 3)

See also: Reciprocal Rank Fusion (RRF), Hybrid search, Reranking

G

GraphRAG

A family of retrieval approaches that build a knowledge graph over the corpus and use graph structure (entities, relations, communities) to improve the retrieval step in RAG. GraphRAG helps with queries that require connecting facts across documents, where flat vector retrieval picks semantically similar passages but misses the structural relationship. It is expensive to build and maintain, and is worth the investment only when graph-shaped questions dominate the workload.

Learn more: In RAG, Retrieval Quality Beats Generation Quality Per Dollar (Chapter 18)

See also: RAG (Retrieval-Augmented Generation), Dense retrieval, Entity recognition

H

HNSW

Hierarchical Navigable Small World, a graph-based ANN index that is the de-facto default for vector search. HNSW builds a multi-layer proximity graph at index time and greedily navigates it at query time. Its three key parameters (M, efConstruction, efSearch) trade off memory, build time, and the recall-latency curve; most production tuning effort on vector indexes is spent on these three knobs.

Learn more: HNSW Parameter Tuning: M, efConstruction, efSearch Explained (Chapter 15)

See also: ANN (Approximate Nearest Neighbor), Vector index, Vector search

Retrieval that combines a lexical method (BM25 or equivalent) with a dense method (vector search over embeddings), then fuses the two result lists into a single ranking. Hybrid search consistently beats either approach alone on standard benchmarks because lexical and dense retrieval fail on largely disjoint sets of queries. The engineering question is never whether hybrid works, but whether the operational complexity is justified for the specific workload.

Learn more: Hybrid Search vs Vector Search: Where Each Actually Wins (Chapter 3)

See also: Fusion (search), Reciprocal Rank Fusion (RRF), Dense retrieval, Sparse retrieval

I

Interleaving experiment

An online evaluation method that mixes results from two ranking systems into a single result list for each user, then infers which system is better from per-query click wins. Interleaving is dramatically more sensitive than A/B testing for detecting ranking differences, because it controls for query-level variance rather than averaging over it. The trade-off is implementation complexity and the fact that interleaving only measures ranking quality, not downstream business metrics.

Learn more: Interleaving vs A/B Tests: Why Ranking Experiments Are Different (Chapter 13)

See also: NDCG, MRR (Mean Reciprocal Rank), Test collection

Inverted index

A data structure that maps each term in the vocabulary to a posting list of documents containing that term, along with per-occurrence metadata like position and term frequency. Every mainstream lexical search engine is built on an inverted index, and BM25 scoring walks posting lists at query time. Inverted indexes prefer a steady stream of small writes, which puts them at odds with ANN indexes that prefer infrequent large rebuilds.

Learn more: The Vocabulary Mismatch Problem: Why BM25 Fails Silently (Chapter 1)

See also: BM25, TF-IDF, Lexical search, Sparse retrieval

L

Late interaction

A retrieval architecture that defers the query-document interaction until scoring time, using per-token vectors on both sides, rather than collapsing each side to a single vector. ColBERT is the canonical example. Late interaction captures finer signal than a bi-encoder while staying cheaper than a full cross-encoder, at the cost of much larger index storage.

Learn more: Cross-Encoder Reranking: The Highest-Leverage Stage in Hybrid Search (Chapter 6)

See also: ColBERT, Bi-encoder, Cross-encoder

Retrieval that matches surface-form tokens in the query against surface-form tokens in documents, typically via an inverted index scored by BM25 or a similar function. Lexical search is fast, cheap, interpretable, and excellent at exact-term matches (product codes, error strings, named entities). Its main weakness is vocabulary mismatch: users and authors rarely choose the same words for the same concept.

Learn more: The Vocabulary Mismatch Problem: Why BM25 Fails Silently (Chapter 1)

See also: BM25, Inverted index, Sparse retrieval, TF-IDF

M

Matryoshka embeddings

Embeddings trained so that truncating the vector to a shorter prefix still yields a useful representation, just at reduced quality. Matryoshka training lets a single model serve multiple cost-quality trade-offs: the full vector for offline reranking, a short prefix for a fast ANN first pass. In production, Matryoshka truncation combined with scalar quantization can cut vector index cost by 50-75% with recall loss under 2%.

Learn more: Vector Cost Optimization: Matryoshka and Quantization Without the Hype (Chapter 17)

See also: Dimensionality reduction, Embedding, Scalar quantization, Product quantization

MRR (Mean Reciprocal Rank)

The average of 1/rank over a set of queries, where rank is the position of the first relevant result. MRR is sensitive only to the top result, which makes it the right metric for tasks with a single correct answer (navigational queries, known-item search) and the wrong metric when users benefit from multiple relevant results. It is often reported alongside NDCG because they can move in opposite directions.

Learn more: Search Quality Metrics: Optimizing One Number Is Dangerous (Chapter 11)

See also: NDCG, Recall@k, Test collection

MTEB

The Massive Text Embedding Benchmark, a large public leaderboard that scores embedding models across dozens of tasks (retrieval, reranking, classification, clustering, STS). MTEB is useful as a sanity check and as a way to filter down candidate models, but leaderboard position is a poor predictor of performance on your own corpus. Treat MTEB rank as a starting shortlist, not a verdict.

Learn more: Embedding Model Selection: MTEB Rank Is Not Enough (Chapter 8)

See also: BEIR, Embedding, NDCG

N

NDCG

Normalized Discounted Cumulative Gain, a ranking metric that rewards placing highly relevant documents near the top of the result list, with a logarithmic position discount. NDCG is the standard retrieval evaluation metric because it handles graded relevance and penalizes poor ordering, not just missing results. Report NDCG@k for a k that matches how many results users actually see.

Learn more: Search Quality Metrics: Optimizing One Number Is Dangerous (Chapter 11)

See also: MRR (Mean Reciprocal Rank), Recall@k, Test collection

P

Product quantization

A vector compression technique that splits each embedding into sub-vectors, clusters each subspace independently, and stores the cluster id per sub-vector instead of the raw values. Product quantization can cut vector memory by 8-32x with a measurable but often acceptable recall hit. It is the workhorse behind large-scale vector indexes that would not fit in memory at full precision.

Learn more: Vector Cost Optimization: Matryoshka and Quantization Without the Hype (Chapter 17)

See also: Scalar quantization, Matryoshka embeddings, Vector index

Q

Query expansion

Rewriting the user's query to include synonyms, related terms, or generated paraphrases before retrieval. Classic query expansion uses thesauri or pseudo-relevance feedback; modern variants use an LLM to produce multiple reformulations or a HyDE-style hypothetical answer that is embedded and retrieved against. Expansion improves recall but can hurt precision if not paired with a downstream reranker.

Learn more: Query Understanding: Why One Retrieval Path Is Never Enough (Chapter 5)

See also: Query routing, Entity recognition, RAG (Retrieval-Augmented Generation)

Query routing

The query-understanding stage that classifies an incoming query and dispatches it to the right retrieval path. A short product code goes to exact-match lexical; a long natural-language question goes to dense retrieval or a full hybrid pipeline; an out-of-domain query is rejected. Routing is a cheap way to avoid spending expensive model inference on queries that do not need it.

Learn more: Query Understanding: Why One Retrieval Path Is Never Enough (Chapter 5)

See also: Query expansion, Entity recognition, Hybrid search

R

RAG (Retrieval-Augmented Generation)

A pattern where a retrieval system fetches relevant passages from a corpus and hands them to an LLM as grounding context for generation. RAG is the dominant way to ship LLM applications over private or freshly-updated data, and retrieval quality, not the choice of LLM, is usually the bottleneck on output quality. Hybrid retrieval and careful chunking tend to matter more than model size.

Learn more: In RAG, Retrieval Quality Beats Generation Quality Per Dollar (Chapter 18)

See also: Chunking, Dense retrieval, Hybrid search, GraphRAG

Recall@k

The fraction of all relevant documents that appear in the top k results, averaged over queries. Recall@k is the right metric for the first retrieval stage in a hybrid pipeline, because the reranker downstream can only promote documents that retrieval actually surfaced. A low recall@100 puts a hard ceiling on end-to-end quality that no reranker can raise.

Learn more: Search Quality Metrics: Optimizing One Number Is Dangerous (Chapter 11)

See also: NDCG, MRR (Mean Reciprocal Rank), Test collection

Reciprocal Rank Fusion (RRF)

A fusion method that combines ranked result lists by summing 1/(k + rank_i) across lists for each document, where k is a small constant (usually 60). RRF ignores raw scores and depends only on rank positions, which makes it robust to the score-calibration differences between lexical and dense retrievers. It is the default fusion method in most production hybrid systems because it works well out of the box with almost no tuning.

Learn more: Hybrid Search vs Vector Search: Where Each Actually Wins (Chapter 3)

See also: Fusion (search), Hybrid search

Reranking

A second-stage scoring step that takes the top-N results from first-stage retrieval and reorders them with a heavier, higher-quality model, typically a cross-encoder. Reranking is the single highest-leverage stage in a hybrid pipeline because it operates on a shortlist and can use per-pair computation that is too expensive for first-stage retrieval. Stacking multiple rerankers rarely pays off under a fixed latency budget.

Learn more: Cross-Encoder Reranking: The Highest-Leverage Stage in Hybrid Search (Chapter 6)

See also: Cross-encoder, Bi-encoder, Late interaction, ColBERT

S

Scalar quantization

A vector compression technique that maps each float32 dimension to a smaller integer type (int8 is common). Scalar quantization cuts memory 4x with negligible recall loss when the embedding distribution is well-behaved, and is often the first quantization step applied in production vector indexes. It composes cleanly with Matryoshka truncation and product quantization for additional savings.

Learn more: Vector Cost Optimization: Matryoshka and Quantization Without the Hype (Chapter 17)

See also: Product quantization, Matryoshka embeddings, Vector index

Retrieval that ranks by semantic similarity between query and documents, typically implemented via dense embeddings and vector search. Semantic search solves the vocabulary mismatch problem that plagues lexical retrieval, at the cost of a new class of failures (entity confusion, hallucinated similarity, blindness to negation). It is a useful marketing term and a fuzzy technical one; in practice, semantic search means dense retrieval.

Learn more: Vector Search Always Returns Something, Even When It Should Not (Chapter 2)

See also: Dense retrieval, Vector search, Embedding, Vocabulary mismatch

Sentence embeddings

Embeddings produced by models trained specifically to map whole sentences or short passages into a shared vector space where similarity corresponds to semantic relatedness. Sentence-BERT kicked off the modern line of this work; current best-in-class models are trained with contrastive loss on massive web-scale pairs. Most retrieval embedding models are sentence-embedding models under the hood.

Learn more: Embedding Model Selection: MTEB Rank Is Not Enough (Chapter 8)

See also: Embedding, Bi-encoder, Contrastive loss, Dense retrieval

Sparse retrieval

Retrieval based on sparse representations where most dimensions are zero, historically via inverted-index lookups on term weights (TF-IDF, BM25) and more recently via learned sparse models like SPLADE that use an LLM to predict term weights over the vocabulary. Sparse retrieval preserves exact-match signal that dense embeddings lose and plays a complementary role in hybrid pipelines. Learned sparse models try to close the gap with dense retrieval while keeping the operational advantages of an inverted index.

Learn more: The Vocabulary Mismatch Problem: Why BM25 Fails Silently (Chapter 1)

See also: BM25, TF-IDF, Inverted index, Dense retrieval, Lexical search

T

Test collection

A dataset of queries paired with relevance judgments over a document corpus, used to compute offline retrieval metrics. The Cranfield and TREC traditions produced the test collections that still anchor IR evaluation today. A good internal test collection, even a small one, is the single most important tool for preventing silent regressions in a search system.

Learn more: LLM-as-Judge for Search: Scalable, Biased, Calibratable (Chapter 12)

See also: NDCG, MRR (Mean Reciprocal Rank), Recall@k, BEIR

TF-IDF

Term-frequency times inverse-document-frequency, the classic sparse term weighting scheme that scores a document by how often a query term appears, scaled down by how common the term is across the corpus. BM25 is a refinement of TF-IDF with saturation and length normalization. TF-IDF is rarely used directly in production anymore, but the idea underlies most lexical ranking functions.

Learn more: The Vocabulary Mismatch Problem: Why BM25 Fails Silently (Chapter 1)

See also: BM25, Inverted index, Lexical search, Sparse retrieval

V

Vector index

A data structure that stores embeddings and supports fast approximate nearest-neighbor queries over them. HNSW is the most common implementation; IVF, ScaNN, and DiskANN are the other common backends. Vector indexes prefer infrequent large rebuilds, which puts them at odds operationally with inverted indexes that want a steady stream of small writes.

Learn more: HNSW Parameter Tuning: M, efConstruction, efSearch Explained (Chapter 15)

See also: HNSW, ANN (Approximate Nearest Neighbor), Vector search, Product quantization

Retrieval by approximate nearest neighbor over a vector index of dense embeddings. Vector search handles semantic matches that lexical retrieval misses, but it always returns results (the nearest neighbor always exists), which is why it can be confidently wrong on queries that lexical search would simply fail to match. Pairing it with a lexical retriever is the whole point of hybrid search.

Learn more: Vector Search Always Returns Something, Even When It Should Not (Chapter 2)

See also: ANN (Approximate Nearest Neighbor), HNSW, Vector index, Dense retrieval, Semantic search

Vocabulary mismatch

The empirical observation that two people rarely choose the same term for the same concept, so users and document authors routinely describe the same idea with different words. Furnas et al. (1987) measured agreement below 20% across domains. Vocabulary mismatch is the core reason lexical retrieval fails silently on a substantial fraction of real queries, and it is the problem dense retrieval was designed to solve.

Learn more: The Vocabulary Mismatch Problem: Why BM25 Fails Silently (Chapter 1)

See also: BM25, Lexical search, Semantic search, Dense retrieval