Reranker Distillation: Cross-Encoder Quality at a Fraction of the Latency

A distilled MiniLM-L6 cross-encoder matches the quality of its 12-layer sibling at roughly double the throughput, and it beats an undistilled BERT-Large by 0.94 nDCG@10 while running about 18 times faster on the same GPU. That one row in the sentence-transformers benchmark is the clearest argument for why distillation dominates production reranking stacks. A cross-encoder reranker turns first-stage candidates from a hybrid retrieval pipeline into the final ranking, which makes it the highest-accuracy and highest-leverage stage in the retrieval architecture. The question for engineers is how to get that quality-per-millisecond profile without giving up relevance.

The quality-latency wall

Independent measurements for the widely used ms-marco-MiniLM-L6-v2 reranker report about 12 ms per single query-document pair and roughly 740 ms for a batch of 100 candidates on CPU (Metarank, 2024). A BERT-Large cross-encoder is substantially slower; monoT5-3B sits higher still. LLM-based rerankers occupy a different tier entirely: cross-encoders remain roughly 100 times more efficient than LLM rerankers of comparable quality (Déjean et al., 2024), and listwise LLM rerankers add 4 to 6 seconds per query on top of that. Representative nDCG@10 numbers cluster in a relatively narrow band across these families, which means the interesting axis is quality per millisecond, not quality alone.

ColBERT-style late-interaction models land in between. ColBERTv2 adds residual compression for a 6 to 10 times space reduction (Santhanam et al., 2022), which shifts the operating point without changing the core accuracy-cost tension.

Distillation in one paragraph

Knowledge distillation trains a smaller student to mimic the scoring behavior of a larger teacher, and it is the mechanism behind almost every deployed MiniLM-class reranker. The headline result is how far the compression can go: cross-encoders trained with the Rank-DistiLLM recipe reach the effectiveness of LLM rerankers like RankZephyr while running up to 173 times faster and using 24 times less memory (Schlatt et al., 2025). RankZephyr itself is a distilled artifact, trained with GPT-4 as teacher, and it matches or surpasses GPT-4 on the NovelEval test set designed to avoid training-data contamination (Pradeep et al., 2023).

What you actually get

The distilled student fits inside the latency budget that lets you rerank 50 to 100 candidates rather than 10 to 20, which often matters more for end-to-end quality than swapping in a slightly stronger teacher.

Trade-offs worth naming

Distillation is not a free lunch. Student quality is bounded by the teacher's quality and the coverage of the training set. If your teacher is wrong about rare query types, the student will be wrong in the same way. Periodic re-distillation with updated training data is often required, which folds naturally into the monitoring described in the post on embedding drift monitoring. And distillation is worth the engineering effort only once your first-stage retrieval is reasonable; the post on hard negative mining covers the other major training lever.

Domain fit deserves its own warning. A reranker that scores 74 nDCG@10 on MS MARCO web queries can actively degrade QA retrieval below a no-reranker baseline when the target domain diverges. Training-domain overlap with the target use case is the single most important selection criterion, and no amount of distillation cleverness rescues a student whose teacher saw the wrong data.

The open questions

The part that usually stalls teams is not the idea of distillation, it is the recipe. Which loss function carries the teacher's signal best: pointwise BCE, Margin-MSE on relevant-versus-negative score differences, or a listwise objective that optimizes over entire ranked lists? How large should the teacher be relative to the student before returns flatten? How many labeled examples are enough, and where does the curve bend between 2K, 10K, and 20K? Which first-stage retriever should the hard negatives come from, given that cross-encoders trained on negatives from the retriever they will rescore outperform those trained on negatives from a different retriever? These are the decisions that separate a distilled reranker that survives production from one that quietly underperforms its teacher on the queries that matter. The full chapter on choosing and training reranker models provides that framework.

Reranker Distillation: Cross-Encoder Quality at a Fraction of the Latency

The quality-latency wall

Distillation in one paragraph

What you actually get

Trade-offs worth naming

The open questions

Chapter 10: Choosing and Training Reranker Models

Laszlo Csontos

Related Posts