Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

2026-04-02 • Computation and Language

Computation and LanguageArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors studied how language models handle adding new special words (tokens) for specific tasks. They found that the common way of starting these new tokens as average embeddings makes them all look too similar, which later training can’t fully fix. To improve this, they suggest placing new tokens in meaningful spots in the existing language space before training, called Grounded Token Initialization (GTI). Their method helps the model learn better and keeps new tokens distinct, leading to better results on recommendation tasks.

language modelstoken initializationvocabulary extensionembedding spacefine-tuningsemantic tokensgrounded initializationgenerative recommendationlinguistic groundingpaired supervision

Authors

Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak

Abstract

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

View PDFOpen arXiv