Decoupled Residual Quantization for Robust Semantic IDs in Recommendation

2026-06-01 • Information Retrieval

Information Retrieval

AI summaryⓘ

The authors study how semantic IDs, which represent items using shared token sequences, can sometimes fail due to issues like poor codebook use, unstable boundaries, or distorted embeddings. They created a framework to measure these problems by looking at how often codes get confused and how many effective codes are really usable. To test their ideas, they introduce a method called Decoupled Residual Quantization (DRQ) that separates the tasks of geometry reconstruction and distribution matching. Their experiments on a real industrial dataset show that good semantic IDs must balance several goals, including robustness and accurate reconstruction, but these results are specific to their dataset.

Semantic IDsTokenizerCodebookCodeword confusionEffective codebook capacityEuclidean embedding spaceDecoupled Residual QuantizationSymbolic robustnessReconstruction fidelitySoft matching

Authors

Xuesi Wang, Junjie Wang, Ziliang Wang, Weijie Bian, Guanxing Zhang

Abstract

Semantic IDs represent items as shared discrete token sequences and have become a practical tool for recommendation and retrieval. Yet it remains difficult to tell why a tokenizer fails: poor quality may come from codebook underutilization, unstable decision boundaries, or geometric distortion of the embedding space. This paper develops a quantitative framework for diagnosing these failures through expected codeword overlap and effective codebook capacity. The former measures expected codeword confusion under retrieval-time perturbation, while the latter converts that confusion into an effective number of usable, well-separated codes. The framework links semantic boundary confusion to both code usage imbalance and Euclidean geometric constraints. As a proof of concept, we present Decoupled Residual Quantization (DRQ), which separates continuous geometry reconstruction from discrete distribution matching. Experiments on a large-scale industrial dataset show that Semantic ID quality is multi-objective: symbolic robustness, reconstruction fidelity, and behavior-aware soft matching each stress different aspects of a tokenizer. These downstream observations are based on one proprietary industrial dataset, so they should be read as a case study rather than a universal benchmark claim.

View PDFOpen arXiv