AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing

2026-05-25 • Computation and Language

Computation and LanguageArtificial IntelligenceComputers and Society

AI summaryⓘ

The authors studied how the language used by AI, especially certain words, has changed news writing in 34 different languages. They found that many languages tend to use similar AI-preferred words, like verbs meaning 'emphasize,' showing a shared pattern across diverse languages. By comparing texts before and after ChatGPT's release, they saw a noticeable increase in these AI-associated words in most languages. Their analysis suggests that AI might be causing languages around the world to become more similar in the words they use. They confirmed their findings with many thorough tests to ensure reliability.

lexical shiftWMT News Crawl corpusGPT-4.1log prevalence ratioscross-lingual semantic convergencediachronic analysisemphasize-type verbslanguage homogenizationembedding analysisChatGPT

Authors

Thomas Stephan Juzek

Abstract

AI-associated lexical shifts have been documented mainly in Scientific English. We extend this work to 34 languages in the WMT News Crawl corpus, refining a split-halves continuation diagnostic that compares GPT-4.1 continuations with matched human gold-standard text. For each language, we derive ranked AI-overused lemmas using log prevalence ratios. We find substantial cross-lingual semantic convergence: semantically related concepts recur across typologically diverse languages, with 'emphasize'-type verbs appearing in 24 of 34 languages. Embedding-based and manual analyses support this pattern. We also examine diachronic uptake in news writing before and after ChatGPT's release. Tracking each language's top 20 AI-overused items, we find prevalence increases in 26 of 34 languages from 2020-2021 to 2023-2024, with a mean change of +15.1%, whilst matched baseline words show no comparable increase (-4.5%). In 10 languages with longer historical coverage, longitudinal analyses show post-2022 increases that exceed the modest shifts observed in earlier periods, though with smaller effect sizes than in Scientific English. We validate our approach extensively, including across seeds, model variants, data sizes, model families, and more. Our findings are consistent with the view that AI-associated lexical preferences extend beyond English and may exert cross-lingual homogenising pressure on global language use.

View PDFOpen arXiv