On the Effect of Segmentation Width and Cluster Size on Speech Resynthesis and Continuation in Generative Spoken Language Models

2026-06-22Computation and Language

Computation and LanguageSound
AI summary

The authors studied a way to teach computers to understand and generate speech without using written text, by turning speech into coded sounds. They experimented with different levels of sound detail (bitrates) and found that the computer can still create clear and natural-sounding speech even at lower detail levels than usual. They also checked how well the computer can continue speaking smoothly, which stayed good across these levels. However, the authors noted that current automated ways to judge speech quality don't match well with human opinions, so better evaluation methods are needed.

Generative Spoken Language Modelingdiscrete speech representationsbitrateK-means clusteringspeech synthesislanguage modelsspeech continuationautomatic evaluation metricsLLM-based metricsspeech intelligibility
Authors
Shunsuke Kando, Wataru Nakata, Shinnosuke Takamichi, Yusuke Miyao
Abstract
Generative Spoken Language Modeling (GSLM) enables text-free speech modeling by training language models (LMs) using discrete speech representations instead of textual transcription. In this paper, we investigate the performance of GSLM on speech synthesis and continuation using discrete speech representations with varying bitrates. We segment speech representations with fixed widths and train K-means models in multiple cluster sizes, resulting in various bitrate settings. We demonstrate that intelligible and natural speech can be synthesized at lower bitrate settings than the baseline. Furthermore, speech continuation quality remains stable at lower bitrates across multiple metrics, suggesting that the conventional GSLM setting may be redundant for effective speech generation. Although LLM-based metrics show higher correlation with human subjective score than conventional metrics, it remains low, highlighting the need for more stable automatic evaluation methods.