The BD-LSC Dataset: Facilitating the Benchmarking of Models for Lexical Semantic Change Detection in Slang and Standard Usage

2026-06-15 • Computation and Language

Computation and Language

AI summaryⓘ

The authors study how word meanings change over time, especially when words both gain new meanings and lose old ones, which is hard to detect. They created two new datasets: one that tracks words gaining, losing, or keeping meanings over three time periods, and another that focuses on words used as both slang and standard speech with detailed meaning labels. They tested various types of computer models to see how well they detect these changes and found that GPT-4o worked best overall. However, identifying rare slang meanings is still very challenging for all models.

semantic changelexical semanticsword sense disambiguationslangcontextual embeddingssupervised learningtransformer modelslarge language modelsbenchmark datasetfew-shot learning

Authors

Afnan Aloraini, Viktor Schlegel, Goran Nenadic, Riza Batista-Navarro

Abstract

Automatic semantic change detection aims to identify how word meanings shift over time, offering insights into both linguistic and societal change. Despite recent progress in computational lexical semantic change (LSC), existing benchmarks and methods struggle to capture bi-directional semantic change, particularly cases where words simultaneously gain and lose senses. This problem is especially challenging for words that have both slang and standard meanings. To address these gaps, we introduce two complementary benchmark datasets. The Bi-Directional Lexical Semantic Change (BD-LSC) dataset captures sense gain, sense loss, and stability across three time periods, enabling the study of complex semantic trajectories. The SlangTrack Word Sense Disambiguation (ST-WSD) dataset provides fine-grained, instance-level sense annotations for words combining slang and standard usages, supporting systematic benchmarking of WSD and semantic change detection models. Using these benchmarks, we systematically evaluate models across different methodological families: unsupervised clustering using contextualised embeddings, supervised machine learning, transformer-based models, and state-of-the-art large language models. Among the evaluated systems, the few-shot GPT-4o model achieved the strongest aggregate performance on Exact Sense Match (ESM) and multi-label accuracy; however, Macro-F1 scores near 0.5 across all systems show that rare slang senses remain difficult, which we identify as the central open challenge.

View PDFOpen arXiv