Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing

2026-05-25 • Machine Learning

Machine Learning

AI summaryⓘ

The authors developed a new way called Contrastive Decoding Diffing (CDD) to find exactly what facts a language model learned during fine-tuning, without needing to look inside the model's internals or training data. CDD works by comparing the model's output patterns instead of its weights, making it faster and easier than previous methods. They tested CDD on different model sizes and found it could uncover precise details like drug names and vote counts more accurately and quickly. The method also revealed unexpected mistakes from the data used to train the model, showing it can help find hidden issues in AI systems. Overall, the authors show CDD is a useful tool for checking what models actually know and improving transparency.

language modelsfine-tuningactivation differencelogit distributionsmodel auditingContrastive Decoding Diffingmodel transparencydata pipeline artifactswhite-box accessoutput-level analysis

Authors

Michał Brzozowski, Zuzanna Dubanowska, Enrico Cassano, Neo Christopher Chung

Abstract

Narrowly finetuned language models memorize implanted content verbatim, but auditing what a deployed model has been taught, without access to its weights or training data, remains an open challenge. Recent work shows that activation differences between base and finetuned models carry readable traces of the finetuning domain; the state-of-the-art Activation Difference Lens (ADL) recovers a vague domain-level description but requires full "white-box" access to model internals. We introduce Contrastive Decoding Diffing (CDD), a model diffing method that operates on output-level logit distributions only, with no weight access, no layer selection, and no per-model tuning, yet recovers implanted facts. CDD consists of three ideas: bypassing the chat template to expose the raw finetuning prior, seeding generation with maximally vague pre-fills, and amplifying the logit-space difference between finetuned and base models at each decoding step. A single default configuration recovers implanted facts verbatim -- exact drug names, vote counts, physical measurements, and procedural details -- across four architectures (1B--32B parameters), uniformly outperforming ADL despite less access and running ~170x faster. Furthermore, CDD surfaces unintended data pipeline artifacts: a fictional persona introduced by the LLM data generator via mode collapse leaked into model weights and was extracted by CDD, constituting to our knowledge the first demonstrated end-to-end fingerprinting chain from data generator artifact to model weights to recovered output. We validate on real-domain finetuning settings, achieving near-perfect recovery across all single-dataset non-CoT variants and correctly identifying all four datasets in the mixed-dataset setting. CDD's success as a grey-box method outperforming white-box baselines underscores its practical utility for transparency and accountability in AI systems.

View PDFOpen arXiv