Interleaving vs A/B Tests: Why Ranking Experiments Are Different

In one Bing experiment, a ranking bug degraded result quality, and the engagement signal moved the wrong way: queries climbed 10% and revenue climbed 30% because users had to search more to find anything useful (Dmitriev et al., 2016). Search experimentation routinely sets that kind of trap. Small per-query effects, reformulation loops, and positional bias mean behavioral metrics on a search pipeline can lie in both directions. A/B tests and interleaving are the two online methods that try to keep them honest, and they are not interchangeable.

What A/B tests give up on search

A/B tests randomize users into control and treatment groups and compare aggregate behavior. Standard guidance is to randomize at the user level and analyze at the user level: finer randomization (per query, per page view) creates variance-inflation problems when the analysis unit does not match (Deng et al., 2011; Kohavi et al., 2020). That is the right default, but it has a cost. User-level randomization produces fewer experimental units than the number of queries, so detecting small ranking improvements requires more traffic and longer experiments.

For teams with medium traffic, or for changes that produce small per-query effects (a reranker tweak, a fusion weight change), an A/B test can take weeks to reach significance. That is often longer than the iteration speed the team wants, and the delay compounds: every week spent waiting is a week novelty and primacy effects can distort the readout.

How interleaving works

Team Draft Interleaving, introduced by Radlinski, Kurup, and Joachims (2008), uses a sports-draft metaphor. For each query, the control ranker and the treatment ranker take turns picking documents for a single merged list. A coin flip decides who picks first. Each clicked document is credited to whichever ranker selected it, and after enough impressions, the ranker with more credited clicks is declared the winner.

Because every query produces a paired preference signal instead of every user producing one aggregate signal, interleaving has dramatically higher statistical sensitivity than an A/B test on the same change. In a production comparison reported by DoorDash (2024), interleaving required roughly 50 to 100 times fewer samples than A/B testing to reach equivalent significance, collapsing decisions from weeks to days. Airbnb (Zhang et al., 2025) reported a similar 50x speedup on its marketplace ranker, with a 0.65 correlation between interleaving scores and the A/B outcomes they eventually gated on.

Why it is not a drop-in replacement

Interleaving only measures ranking preference, not session-level or business-level outcomes. It cannot tell you whether a change improved conversion, revenue per session, or long-term engagement. It also does not generalize to changes outside the ranked list: UI tweaks, latency changes, or changes to which retrieval path runs in the first place. A cross-encoder reranker swap is well-suited to interleaving. A change in how the query understanding layer decides which retrieval path to run is not.

So when should you use which? The two methods answer different questions. Interleaving tells you which of two rankers users prefer on the queries where both rankers compete. An A/B test tells you what happens to the business when you ship a change. The 0.65 correlation in the Airbnb numbers is a reminder that a clear interleaving win is not a guaranteed A/B win.

What this article is not resolving

The gap between "interleaving says A beats B" and "it is safe to ship A" is where the chapter does most of its work, and it is the part a short post cannot fake. A few decisions carry most of the weight.

The first is the randomization unit. User, session, or query changes both the sample size and the interpretation of every behavioral metric you compute on top. The wrong choice inflates false positives before any ranker is compared (Deng et al., 2011).

The second is variance reduction. CUPED adjusts each user's experiment-period metric by a pre-experiment covariate, and in production at Bing it roughly halved the variance of engagement metrics (Deng et al., 2013). Combine it with trigger analysis, which restricts the readout to queries that actually activated the treatment, and a change that looked undetectable in the full population becomes a clean win. Ignore trigger analysis and a real long-tail improvement disappears into the noise.

The third is guardrails. What latency regression is too much? What zero-result-rate jump blocks a launch even if the OEC improves? What do you do when a 2% NDCG gain ships with a 150ms p99 increase? Guardrails are where experimentation stops being a statistics problem and becomes a product-judgment one.

How do those three choices fit together for your traffic volume, your query mix, and your tolerance for shipping a regression you cannot see yet? That is the question to sit with before running either kind of experiment.

Interleaving vs A/B Tests: Why Ranking Experiments Are Different

What A/B tests give up on search

How interleaving works

Why it is not a drop-in replacement

What this article is not resolving

Chapter 13: Online Evaluation and Experimentation

Laszlo Csontos

Related Posts