Reranker Distillation: Cross-Encoder Quality at a Fraction of the Latency
Distilling a large cross-encoder into a smaller model approximates its quality at a fraction of the serving cost. It is how most production rerankers actually get deployed.
A distilled MiniLM-L6 cross-encoder matches the quality of its 12-layer sibling at roughly double the throughput, and it beats an undistilled BERT-Large by 0.94 nDCG@10 while running about 18 times faster on the same GPU. That one row in the sentence-transformers benchmark is the clearest argument for why distillation dominates production reranking stacks. A cross-encoder reranker turns first-stage candidates from a hybrid retrieval pipeline into the final ranking, which makes it the highest-accuracy and highest-leverage stage in the retrieval architecture. The question for engineers is how to get that quality-per-millisecond profile without giving up relevance.
The quality-latency wall
Independent measurements for the widely used ms-marco-MiniLM-L6-v2 reranker report about 12 ms per single query-document pair and roughly 740 ms for a batch of 100 candidates on CPU (Metarank, 2024). A BERT-Large cross-encoder is substantially slower; monoT5-3B sits higher still. LLM-based rerankers occupy a different tier entirely: cross-encoders remain roughly 100 times more efficient than LLM rerankers of comparable quality (Déjean et al., 2024), and listwise LLM rerankers add 4 to 6 seconds per query on top of that. Representative nDCG@10 numbers cluster in a relatively narrow band across these families, which means the interesting axis is quality per millisecond, not quality alone.
ColBERT-style late-interaction models land in between. ColBERTv2 adds residual compression for a 6 to 10 times space reduction (Santhanam et al., 2022), which shifts the operating point without changing the core accuracy-cost tension.
Distillation in one paragraph
Knowledge distillation trains a smaller student to mimic the scoring behavior of a larger teacher, and it is the mechanism behind almost every deployed MiniLM-class reranker. The headline result is how far the compression can go: cross-encoders trained with the Rank-DistiLLM recipe reach the effectiveness of LLM rerankers like RankZephyr while running up to 173 times faster and using 24 times less memory (Schlatt et al., 2025). RankZephyr itself is a distilled artifact, trained with GPT-4 as teacher, and it matches or surpasses GPT-4 on the NovelEval test set designed to avoid training-data contamination (Pradeep et al., 2023).
What you actually get
The distilled student fits inside the latency budget that lets you rerank 50 to 100 candidates rather than 10 to 20, which often matters more for end-to-end quality than swapping in a slightly stronger teacher.
Trade-offs worth naming
Distillation is not a free lunch. Student quality is bounded by the teacher's quality and the coverage of the training set. If your teacher is wrong about rare query types, the student will be wrong in the same way. Periodic re-distillation with updated training data is often required, which folds naturally into the monitoring described in the post on embedding drift monitoring. And distillation is worth the engineering effort only once your first-stage retrieval is reasonable; the post on hard negative mining covers the other major training lever.
Domain fit deserves its own warning. A reranker that scores 74 nDCG@10 on MS MARCO web queries can actively degrade QA retrieval below a no-reranker baseline when the target domain diverges. Training-domain overlap with the target use case is the single most important selection criterion, and no amount of distillation cleverness rescues a student whose teacher saw the wrong data.
The open questions
The part that usually stalls teams is not the idea of distillation, it is the recipe. Which loss function carries the teacher's signal best: pointwise BCE, Margin-MSE on relevant-versus-negative score differences, or a listwise objective that optimizes over entire ranked lists? How large should the teacher be relative to the student before returns flatten? How many labeled examples are enough, and where does the curve bend between 2K, 10K, and 20K? Which first-stage retriever should the hard negatives come from, given that cross-encoders trained on negatives from the retriever they will rescore outperform those trained on negatives from a different retriever? These are the decisions that separate a distilled reranker that survives production from one that quietly underperforms its teacher on the queries that matter. The full chapter on choosing and training reranker models provides that framework.
Related chapter
Chapter 10: Choosing and Training Reranker Models
Because a reranker evaluates a query and a candidate document together rather than encoding them in isolation, it can capture fine-grained relevance signals that bi-encoders miss, at a substantial inference cost. The chapter recommends sensible starting checkpoints, walks through distilling them into smaller and faster variants, explains domain adaptation strategies, and defines the conditions under which training a custom reranker from scratch is actually the right call.
Get notified when the book launches.
Laszlo Csontos
Author of Designing Hybrid Search Systems. Works on search and retrieval systems, and writes about the engineering trade-offs involved in combining keyword and vector search.
Related Posts
Benchmark leaderboards are a starting point for picking an embedding model, not a decision. Domain fit matters more than MTEB rank, and the commitment is harder to reverse than it looks.
March 2, 2026
Given fixed positives, the choice of negatives is the highest-leverage lever left in embedding fine-tuning, and getting it wrong quietly poisons retrieval quality.
February 23, 2026