"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

2026-06-01Computation and Language

Computation and LanguageArtificial IntelligenceMachine Learning
AI summary

The authors present a new way to measure how diverse creative texts are, which helps to understand problems like repetitive writing by AI models. Their method, called the Decan metric, calculates a diversity score directly from a language model's output probabilities without needing extra training or reference data. They tested it on a human-judged benchmark and found it performs well, though slightly behind the best neural methods. Additionally, the metric successfully tracks decreases in diversity during AI model training stages that affect creativity.

diversity measurementin-context learninglanguage modelslog-probabilitiespost-training mode collapseDecan metriccreative writingbenchmark evaluationAI text generationtraining stages
Authors
Matthew Khoriaty, David Williams-King, Shi Feng
Abstract
Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. We propose a new approach to measuring diversity using in-context learning, of which the ``Decan'' metric, $D_{Ca_n} = C \times a_n$, is the working instance we evaluate: a per-byte score read off the per-token log-probabilities of a base model $θ$ in a \emph{single forward pass} per permutation, with no embedding model, no reference corpus, and no human labels. This approach is grounded in information theory, makes use of language model in-context learning to detect a wide range of similarities between any number of inputs, and obviates the need to train a special-purpose model. The same pipeline scores AI samples and human-written response sets, with diversity treated as a property of (responses, prompt, scoring model). On Tevet and Berant's human-grounded McDiv benchmark, $D_{Ca_n}$ reaches OCA 0.846 on the McDiv prompt\_gen set where it performs best, behind the strongest neural baseline reported in Tevet and Berant (SentBERT, 0.897). On the OLMo-2-7B post-training pipeline, $D_{Ca_n}$ drops monotonically across the base $\to$ SFT $\to$ DPO $\to$ RLVR stages, detecting the type of diversity loss that creative-writing applications care about.