On the Salience of Low-Probability Tokens for AI-Generated Text Detection: A Multiscale Uncertainty Perspective

2026-06-01 • Computation and Language

Computation and Language

AI summaryⓘ

The authors address the challenge of detecting AI-generated text mixed with human writing, which is important for preventing issues like misinformation. They identify two main problems in current detection methods: common repeated phrases (boilerplate) that confuse detectors and unstable results when relying on single scores. Their solution, called Uncertainty, focuses on rare, low-probability words to better spot differences between AI and human text. They improve this with Uncertainty++, which makes the detection more stable by using advanced sampling techniques. Tests on many datasets and AI models show their method works well and is reliable.

AI-generated textboilerplate tokenslog-probabilitiesRényi entropyuncertainty estimationlanguage modelsprobability distributionadversarial robustnessconditional independent sampling

Authors

Yikai Guo, Bin Wang, Xilai Fan, Wenjun Ke, Haoran Luo

Abstract

AI-generated text increasingly blends with human writing, raising practical risks such as misinformation, academic misuse, and corpora contamination. While statistical detectors are appealing for efficiency and generalization, they suffer from two key limitations. (i) Boilerplate dominance, boilerplate tokens shared across human and LLM writing can overwhelm discriminative signals. (ii) Brittle point estimates, relying on a single probability score yields unstable decisions under adversarial manipulations. To address these issues, we propose Uncertainty, a multiscale uncertainty estimator that focuses on informative low-probability tokens, which more clearly expose distributional discrepancies. Locally, it alleviates boilerplate dominance by averaging the log-probabilities of low-probability tokens; globally, it reduces brittleness by capturing the distributional shape of this low-probability region via Rényi entropy. We further extend the detector to Uncertainty++ via conditional independent sampling, yielding a more stable uncertainty estimation. Experiments across seven datasets and sixteen LLMs demonstrate high effectiveness, generalization, and robustness. Our code is available at https://github.com/guoyikai2000/Uncertainty-AIGT.

View PDFOpen arXiv