TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech
2026-06-08 • Sound
SoundArtificial Intelligence
AI summaryⓘ
The authors describe a new method called TLDR that speeds up text-to-speech (TTS) models which usually process speech as very long sequences of small tokens. Instead of handling each tiny token one by one, TLDR groups these tokens into larger patches, making the sequence shorter and easier to manage. This approach reduces the memory needed and makes the model run faster without changing its basic components. Their experiments show that this patch-level method can effectively lower the computing cost while maintaining quality.
text-to-speechautoregressive modelscodec tokenscausal modelingKV cachelatent patchesLoRAinference speedspeech tokenizationmachine learning compression
Authors
Yejin Lee, Junwon Moon, Hyoeun Kim, Hyunjin Choi, Heeseung Kim, Kyuhong Shim
Abstract
Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.