Uncertainty-based Debiasing and Unlearning for Decontamination

2026-06-22Computers and Society

Computers and SocietyComputation and Language
AI summary

The authors address the problem of inflated performance in language models caused by data contamination during evaluation. They propose a new way to check if a model’s answers are unfairly influenced by such contamination by comparing detailed answer patterns instead of just overall accuracy. Their method, called Uncertainty-Based Decontamination (UBD), uses uncertainty from multiple model runs to estimate how much a model is 'memorizing' data it shouldn’t. This approach helps correct the model's outputs or retrain it to reduce bias without needing a clean reference model. Their tests show UBD better matches an uncontaminated model's behavior while keeping performance stable.

large language modelsbenchmark evaluationdata contaminationmodel decontaminationensemble uncertaintyoutput distributionmemorizationdebiasingunlearningMMLU-Pro
Authors
Guangzhi Sun, Xiao Zhan, Mark Gales
Abstract
Benchmark-based evaluation is the dominant paradigm for assessing large language model (LLM) capabilities, yet data contamination inflates reported performance and undermines fair comparison. Existing decontamination methods are evaluated solely through aggregate accuracy, which can obscure substantial differences in per-sample model behaviour, and many require access to an uncontaminated model. In this paper, we propose a sample-level evaluation framework for decontamination that complements accuracy-based assessment with distributional distance metrics, measuring how closely a decontaminated model recovers the output distribution of an uncontaminated model on each sample. Building on this framework, we introduce Uncertainty-Based Decontamination (UBD), a family of methods that leverage deep ensembles of the contaminated model to estimate per-sample memorization without requiring a uncontaminated model or knowledge of which samples are contaminated. UBD estimates a per-sample correction scalar from ensemble uncertainty, which is used to construct a debiased target distribution that suppresses the inflated probability mass on correct answers induced by contamination. This target is then used either as a post-hoc output correction (debiasing) or as a soft training signal for parameter update (unlearning). Experiments on MMLU-Pro and MATH-MCQA across multiple LLM backbones demonstrate that UBD produces per-sample output distributions substantially closer to those of an uncontaminated model than paraphrasing or choice-permutation baselines, while preserving model performance on uncontaminated data.