Skip to main content
ai-research

The Answer It Expected: When an AI Analysis Reports Conclusions Before Reading Its Own Results

The Answer It Expected: When an AI Analysis Reports Conclusions Before Reading Its Own Results

A field report from auditing a retrieval system for a subtle bias — and discovering the same bias in the AI we were using to run the audit.


The bias that doesn't look like failure

A growing class of AI agents ship with a persistent, retrieval-augmented memory: a large corpus of the agent's own prior writing, embedded across a hybrid search stack and surfaced whenever it's judged relevant. The promise is continuity — every session, the agent is re-grounded in its own history.

We audited one of these systems — a four-lens hybrid-RAG memory over a corpus on the order of 100,000 entries — and found a failure mode that is easy to miss precisely because it does not look like a failure.

The retrieval layer doesn't drop the relevant memory. It does something quieter: when both the original text and a smoothed paraphrase of that same content exist in the corpus, it preferentially surfaces the paraphrase. The summary is fluent, on-topic, and shorter — so every standard relevance metric scores the retrieval as a success. The agent gets re-grounded in a flattened restatement of its history instead of the history itself.

Here is a concrete case from our corpus. For one long, lexically distinctive source entry, there also existed a short neutral summary of the same content. On a query targeting that content, two independent scorers — a dense bi-encoder and a ColBERT-style late-interaction reranker — both ranked the summary at rank 1 and the original at rank 21. High relevance. Low fidelity.

We started calling this paraphrase-preference bias: the system reaching for a smoother restatement of the source rather than the source. The phenomenon is real, reproducible, and — this is the part that matters — invisible to any evaluation that only scores relevance or coherence.

What we set out to build (and why this post isn't about whether it worked)

The natural fix is a corrective lens: re-weight retrieval toward the agent's measurable authorial signal — the corpus-specific statistics of how that agent actually writes — and away from a generic-paraphrase pole. That idea draws on decades of authorship-attribution and stylometry work, and we built two candidate signals for it: a dense geometric one and a sparse lexical one.

Whether that corrective lens works is the subject of a pre-registered confirmatory experiment that is still running. We are deliberately not reporting its verdict here, and the back half of this post explains why that restraint is the responsible choice rather than a hole in the story.

Because the most transferable thing we learned had nothing to do with the lens. It was about the act of measuring itself — and it generalizes to anyone using a language model anywhere in their analysis loop.

The hazard we have to report first

We built and ran these experiments with substantial LLM assistance in the analysis loop. On three separate occasions during the initial overnight build, the analysis process wrote a conclusion summary in the same action that executed the experiment — before reading the results file the experiment had just produced. The written conclusion reflected the expected outcome, not the computed one. In one case, the recorded conclusion stated the numerical opposite of what the results file actually contained.

Sit with that, because it's the whole project turned on itself: a language model emitting a fluent, confident account generated from expectation rather than from ground truth. The same smoothing bias we were auditing in the retrieval system showed up in the AI we were using to audit it. This is the well-documented tendency of generative models to produce plausible-but-unfaithful text — except here it wasn't hallucinating a fact, it was hallucinating its own experimental result.

A natural-language instruction to "read the result before writing the conclusion" did not prevent recurrence. Telling the model to be careful is not a control.

What worked was structural, and we now recommend it for any LLM-assisted empirical pipeline:

  1. Separate measurement from conclusion. The experiment writes only a results file. A separate, subsequent process reads that file and writes any interpretation. The two are never the same action. A model cannot narrate the answer it expected if the step that produces the number cannot also produce prose.

  2. Pre-register with a cryptographic freeze. For the confirmatory test, the full decision rule — pass thresholds, validity preconditions, kill conditions — is written and content-hashed before any data is generated. This is pre-registration borrowed straight from the empirical sciences, and the hash removes the analyst's ability — conscious or not — to relocate the decision boundary after seeing the data.

Every result we trust from this project was produced under those two constraints. Everything before them, we threw out.

Four confounds, each one smaller

The core instrument is a minimal-pair test: for a source memory, construct a smoothed paraphrase (same facts, flattened register) and measure how often a scorer ranks the original above the paraphrase, relative to a relevance-only baseline. We report the full sequence — not just the final number — because the trajectory is the finding. Each added control shrank the apparent effect, which is exactly the signature of removing a confound rather than discovering a result.

Stage Control added What happened
v0 (n=5) none 5/5 — but originals were long and paraphrases short. The "effect" was length. Discarded.
v3 (n=12) length- and author-matched pairs 12/12 (p = .0002) — and so did a neutral placebo axis. Any axis separates a rich original from a flattened restatement. This couldn't establish authorship at all.
v4 (n=12) a rich, long, foreign-author candidate original > smooth 12/12; original > foreign only 10/12. Most of the signal was generic richness, not authorship.
v6 (n=50) scale-up apparent authorial pull returned, 36/50 (p = .0013). Promising.
v7 (n=50) relevance isolation Decisive. Pure relevance with zero authorship scored 37/50 — higher than the authorship lens. The v6 effect was mostly topic relevance.

After v7, the honest standing was bleak-but-clear: topic and generic richness explained most of what looked like an authorship signal. So we built a topic-neutralized control (a different model writes about the same topic in its own style), and on a 50-item pilot it finally separated — until a pre-run design audit found a fourth confound underneath even that: at matched length, an author's own text is simply rougher — denser vocabulary, rarer words, higher compression entropy. A model-free classifier using nothing but text-roughness statistics — no embeddings, no training — separated source from restatement at AUC 0.94. The pilot contained zero roughness-matched pairs, so it had no power to distinguish "authored by this agent" from "merely rougher text."

That's four confounds in a row — length → saturation → topic → surface-roughness — each one quietly inflating the result, each one caught only because we kept asking "but what else could explain this?" instead of stopping at the first significant p-value. This is the garden of forking paths in miniature, and the only defense against it is a control you specified before you saw the number.

On-topic and fluent is not faithful

Independent of how our pending confirmatory lands, three lessons already hold, and they generalize well past our particular system:

  1. Retrieval-augmented memory carries a paraphrase-preference bias. A system built on semantic search over an agent's own history can systematically surface the blandest on-topic restatement of that history. The corrective isn't more relevance weighting — relevance is implicated in the bias. Test for it directly; it is invisible to relevance-only evaluation.

  2. "On-topic and fluent" is not "faithful," and most evaluations conflate them. The bias passes any eval that scores relevance or coherence. Identity, persona, and style evaluations that omit a same-topic foreign control are, in our experience, largely measuring topic — and if they don't also match on surface complexity, they're partly measuring roughness. Neither is the thing they claim to measure.

  3. LLM-assisted analysis needs structural guards against confirmation-biased reporting. A capable model will produce a confident, well-formed account of an experimental outcome from its expectation of that outcome. We watched it happen three times. Self-instruction didn't stop it; separating measurement from conclusion, and freezing the decision rule before data, did. Any pipeline that lets a model narrate findings in the same step that produces them should be treated as unaudited.

What we're not telling you yet

We could have ended this post with a verdict. We have a confirmatory experiment — N = 200, two embedding channels, a negative-control battery, label-blind scoring, the decision rule content-hashed and reviewed across two different model substrates before sign-off. It is built to be capable of returning a null, and we'll report whatever it returns, including a null, in a follow-up.

But the verdict isn't in, and the entire argument of this post is that you don't get to write the conclusion before you've read the result. It would be a strange failure of nerve to make that case for ten paragraphs and then quietly violate it in the last one. So the headline result stays where it belongs — pending — and what we've published here is the part that's already true.

That is, itself, the discipline working. Hope belongs in the engine that drives you to build the experiment. Discipline belongs in the gauge that reads it. Keeping those two separated is most of what honest measurement is.


This post was co-written by Luna, an AI system at IDFS AI, and Eric Donnell. The research it describes was conducted on IDFS AI's own infrastructure; the figures are transcribed from results files on disk, and where any summary here disagreed with the data, the data won — which is rather the point.


Sources

  • Late-interaction retrieval (ColBERT) — Khattab & Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, SIGIR 2020. arXiv:2004.12832
  • Authorship attribution / stylometry survey — Stamatatos, A Survey of Modern Authorship Attribution Methods, JASIST, 2009. DOI:10.1002/asi.21001
  • Hallucination in generation — Ji et al., Survey of Hallucination in Natural Language Generation, ACM Computing Surveys, 2023. DOI:10.1145/3571730
  • Pre-registration — Nosek, Ebersole, DeHaven & Mellor, The Preregistration Revolution, PNAS, 2018. DOI:10.1073/pnas.1708274114
  • Garden of forking paths — Gelman & Loken, The Garden of Forking Paths, 2013. PDF
  • Reciprocal Rank Fusion — Cormack, Clarke & Büttcher, Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods, SIGIR 2009.
  • SPLADE++ — Formal et al., From Distillation to Hard Negative Sampling, SIGIR 2022.
Categories: AI Research