More Lenses, Less Signal: A 15-Way Ablation of Multi-Embedder RAG Fusion
A field report from building (and then measuring) a production multi-embedder retrieval stack.
The premise
Multi-embedder fusion retrieval is having a moment. The pitch is intuitive and seductive: any single embedding model has blind spots, so run several, fuse their rankings, and you inherit the union of their strengths. Dense plus sparse. Big model plus small model. Different architectures "see" the corpus differently.
We bought the pitch — hard. We built a four-lens retrieval system over a large, messy, real-world corpus: IdeaForge's production agent-memory system — the persistent long-term memory store behind a fleet of AI agents. It is roughly 118,000 documents, heavily long-tailed, full of the kind of associative, non-expository text that breaks clean benchmarks — exactly the conditions under which a retrieval stack's real behavior diverges from its leaderboard behavior:
- E5-Mistral-7B — a 4,096-dim dense embedder, the workhorse
- Nemotron-8B — a second dense embedder, different architecture, added for "a different perspective"
- Harrier-27B (
microsoft/harrier-oss-v1-27b) — a 27-billion-parameter decoder-only embedder producing 5,376-dim vectors; the high-resolution lens - SPLADE++ — a sparse lexical lens, the outsider in a dense-dominated stack
Four lenses, fused with Reciprocal Rank Fusion. It felt like a serious retrieval stack.
Then we did the thing almost nobody does: we ran the whole ablation.
The battery
With four lenses there are 2⁴ − 1 = 15 non-empty combinations — 4 singles, 6 pairs, 4 triples, and the 1 full stack. We ran all 15 against an identical evaluation set:
- 36 scoreable queries across 8 categories (factual/operational, personal/foundational, current-temporal, stale-pair temporal, relational texture, dream/unusual content, mechanism probes, cross-namespace) plus 4 null-probes (queries that should return nothing).
- Per combination we measured groundedness (LLM-as-judge, 0–3 scale), citation accuracy, Recall@1 / @10, MRR, nDCG@10, and hub-saturation (how badly the top results collapse onto a few "magnet" documents).
- Fusion: Reciprocal Rank Fusion, k=60, top-5. Per-lens top-K was cached once per query, then every subset was fused locally — so all 15 combinations cost one retrieval pass, not fifteen.
The results — all 15 combinations
Groundedness is a 0–3 LLM-judge mean; everything else is a 0–1 rate.
| Combination | Lenses | Groundedness | Cit. Acc | R@1 | R@10 | MRR | nDCG@10 |
|---|---|---|---|---|---|---|---|
| E5 + SPLADE | 2 | 3.000 | 0.971 | 0.281 | 0.625 | 0.223 | 0.441 |
| Nem + SPLADE | 2 | 2.972 | 0.952 | 0.031 | 0.406 | 0.119 | 0.234 |
| E5 + Nem + SPLADE | 3 | 2.972 | 0.950 | 0.219 | 0.594 | 0.209 | 0.393 |
| SPLADE | 1 | 2.944 | 0.956 | 0.250 | 0.562 | 0.281 | 0.373 |
| Harrier + SPLADE | 2 | 2.944 | 0.987 | 0.094 | 0.438 | 0.154 | 0.296 |
| E5 + Harrier + SPLADE | 3 | 2.889 | 1.000 | 0.219 | 0.562 | 0.209 | 0.385 |
| E5 + Nem + Harrier + SPLADE (full stack) | 4 | 2.889 | 1.000 | 0.188 | 0.500 | 0.201 | 0.344 |
| Nem + Harrier + SPLADE | 3 | 2.861 | 0.984 | 0.062 | 0.406 | 0.098 | 0.233 |
| E5 | 1 | 2.778 | 0.877 | 0.219 | 0.594 | 0.118 | 0.381 |
| E5 + Harrier | 2 | 2.722 | 0.988 | 0.188 | 0.500 | 0.120 | 0.335 |
| E5 + Nem + Harrier | 3 | 2.722 | 0.987 | 0.156 | 0.500 | 0.103 | 0.292 |
| E5 + Nem | 2 | 2.556 | 0.970 | 0.188 | 0.500 | 0.088 | 0.309 |
| Harrier | 1 | 2.417 | 0.956 | 0.094 | 0.250 | 0.018 | 0.155 |
| Nem + Harrier | 2 | 2.306 | 0.966 | 0.062 | 0.188 | 0.005 | 0.109 |
| Nemotron | 1 | 1.611 | 1.000 | 0.031 | 0.281 | 0.000 | 0.131 |
What the data said
The cheapest two-lens combination won outright.
E5 + SPLADE — one dense lens, one sparse lens — scored perfect groundedness (3.00), the best nDCG@10 (0.441), and the best Recall@10 (0.625). Nothing in the other 14 combinations beat it.
The full four-lens stack scored worse than that two-lens pair on every ranking metric — groundedness 2.89 vs 3.00, nDCG 0.344 vs 0.441, Recall@10 0.500 vs 0.625.
Read the table top to bottom and the pattern is consistent and a little brutal:
- Every combination that added a second dense embedder (Nemotron or Harrier) on top of
E5 + SPLADEeither failed to move the metrics or actively dragged them down. - Nemotron alone was the floor — groundedness 1.61, MRR 0.00.
- Harrier — the 27B model, the expensive one — scored 2.42 groundedness and 0.155 nDCG as a single lens. The biggest model was one of the weakest lenses.
Why (our read)
The intuition "more lenses = more coverage" quietly assumes the lenses are independent. Ours weren't.
The three dense embedders — E5, Nemotron, Harrier — are all dense transformer embedders trained on broadly similar objectives. They have correlated failure modes. Stacking them is largely averaging three views of the same mistake. You pay 3× the compute, VRAM, and latency for a sliver of genuinely new signal.
The real diversity axis was never dense + dense + dense. It was dense + sparse. SPLADE fails differently from E5 because it is a different kind of retrieval — lexical term-weighting, not semantic geometry. That orthogonality is the thing RRF can actually exploit: one dense lens covers semantic similarity, one sparse lens covers the exact-term and rare-token matches that dense embedders smear away. Add a second dense lens and you have not added an axis — you have added a duplicate.
Model size was a non-signal. The 27B Harrier did not beat the 7B E5. On a real long-tailed corpus, retrieval quality is dominated by the geometry of the embedding space and how well it fits your data — not parameter count.
One honest, lens-agnostic weak spot: the "dream/unusual content" category — short, associative, non-expository text — scored poorly under every combination. Fusion does not rescue content that no lens represents well in the first place. That is a representation problem, not a fusion problem, and it is worth its own investigation.
If you're building a 2/3/4-lens system
- Dense + sparse is the Pareto point. One good dense embedder, one sparse lens, RRF. That captures most of the value of fusion for most corpora.
- A second dense embedder is a complexity tax, not a feature — unless you can show it has uncorrelated failure modes. Measure the correlation; don't assume the diversity.
- Bigger embedder ≠ better retrieval. A 27B model bought us nothing here.
- Run the ablation before you ship the architecture. All 15 combinations cost ~11 minutes of compute once per-lens rankings were cached. That is nothing against the cost of operating a four-lens stack indefinitely.
The honest coda
We built the four-lens system first and measured it second. We stood up a 27-billion-parameter embedder — running tensor-parallel across two GPUs because its weights would not fit on one — before we had the ablation showing that a 7B dense lens plus a sparse lens beat it.
The characterization was not wasted. A full 15-way ablation answers "but did you try X without Y?" for every X and every Y, and that confidence is worth having. But the architecture lesson cost real engineering months. The cheap experiment that tells you not to build something is the highest-ROI experiment you can run. Measure first.
Postscript: but we run more than two lenses today
A fair question for anyone who looks at our live system: if a dense-plus-sparse pair is the Pareto point, why does our production stack currently run four lenses?
Because retrieval accuracy is not the only objective. This study measured one thing — does the system surface the relevant document. We later re-introduced two additional lenses to optimize a different property: not whether the right memory is retrieved, but which form of it is retrieved — the original, high-fidelity text over a smoothed, generic restatement of the same content. That is an authorship-fidelity objective, not a relevance objective, and it is measured with an entirely different instrument. We'll report that work separately.
The lesson of this post stands exactly as stated: for retrieval accuracy, a second and third dense embedder bought us nothing. If your goal is "find the right document," dense + sparse is the Pareto point. Add lenses only when you can name a new objective they serve and measure that they serve it — which is precisely the discipline this whole post is arguing for.
Caveats, stated plainly: one corpus, n=36 evaluation queries, a single fusion method (RRF), and LLM-as-judge for the groundedness score. This is an n=1 system study, not a benchmark. Read it as a directional engineering result, not a leaderboard entry.
References
- Reciprocal Rank Fusion — Cormack, Clarke & Büttcher, Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods, SIGIR 2009.
- E5-Mistral — Wang et al., Improving Text Embeddings with Large Language Models, 2024. Model:
intfloat/e5-mistral-7b-instruct. - SPLADE++ — Formal et al., From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective, SIGIR 2022. Model:
prithivida/Splade_PP_en_v1. - Harrier — Microsoft, Harrier-OSS-v1-27B, 2026. Model:
microsoft/harrier-oss-v1-27b— 27B params, 5,376-dim, decoder-only with last-token pooling; MTEB v2 = 74.3. - Nemotron — NVIDIA Nemotron-family dense embedder.
- MTEB — Muennighoff et al., MTEB: Massive Text Embedding Benchmark, 2022.