Accurate and Efficient Statistical Testing for Word Semantic Breadth

2026-05-08Computation and Language

Computation and Language
AI summary

The authors describe a new way to compare how widely different words are used in various contexts, which is important for building dictionaries and thesauri. They found that previous methods could mistakenly see differences caused by direction in meaning, not by actual spread, leading to false positives. To fix this, the authors created a test that aligns the meanings before measuring their spread, making the test more accurate. Their approach also runs much faster using GPUs and reduces errors while still detecting real differences.

contextualized token embeddingsword meaning breadthdispersion statisticsHouseholder reflectionpermutation testType-I errorsemantic directionGPU accelerationnon-parametric p-values
Authors
Yo Ehara
Abstract
Measuring the breadth of a word's meaning, or its spread across contexts, has become feasible with contextualized token embeddings. A word type can be represented as a cloud of token vectors, with dispersion-based statistics serving as proxies for contextual diversity (Nagata and Tanaka-Ishii, ACL2025). These measurements are useful for deciding appropriate sense distinctions when constructing thesauri and domain-specific dictionaries. However, when comparing the breadth of two word types, naive hypothesis testing on dispersion can be misleading: differences in semantic direction can masquerade as dispersion differences, inflating Type-I error and yielding "statistically significant" outcomes even when there is no true breadth difference. This is problematic because significance testing should distinguish genuine effects from incidental fluctuations in small-difference regimes. We propose a Householder-aligned permutation test to isolate dispersion differences from directional differences. Our method applies a single Householder reflection to align the mean directions of the two word types and then performs a permutation test on the aligned token clouds, yielding calibrated, non-parametric p-values. For practicality, we introduce a GPU-oriented implementation that batches permutations and linear algebra operations. Empirically, our alignment reduced Type-I error by 32.5% while preserving sensitivity to genuine breadth differences, and achieved a 23x speedup over the CPU baseline.