AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation
2026-06-22 • Sound
Sound
AI summaryⓘ
The authors introduce AudioCALM, a new method that can create speech, sounds, and music all in one model. They improve how the model predicts audio by using a special flow-based technique instead of traditional token prediction. To handle the different needs of speech (which is very time-specific) versus sounds and music (which are more general), the authors designed a system that treats speech differently without slowing down other tasks. Their tests show AudioCALM works as well as the best models for each type of audio and better than previous unified models.
autoregressive modelaudio generationflow matchingcontinuous audio latentsAsymmetric Mixture-of-Modality-Expertstime-aligned attentionspeech synthesissound generationmusic generation
Authors
Huadai Liu, Kaicheng Luo, Wen Wang, Qian Chen, Bin Ma, Xiangang Li, Wei Xue
Abstract
Unifying speech, sound, and music generation in one model is hindered by tradeoffs between fidelity, end-to-end training, in-context conditioning, and variable-length synthesis that no current paradigm fully resolves. To address this challenge, we present AudioCALM, a universal audio generation framework that extends autoregressive (AR) next-token prediction from discrete tokens to continuous audio latents: a thin flow-matching head replaces the softmax to predict rectified-flow velocities at each position, and a block-causal AR-Flow attention pattern produces arbitrary-length output. Joint training of multiple audio generation tasks faces an asymmetric text--audio mismatch: speech transcripts align to specific time spans and demand tight, time-aligned attention, whereas sound and music captions describe only overall semantics and rely on diffuse, holistic attention; mixing the two disproportionately degrades sound and music generation. We address this asymmetry at two levels: a data reformulation strategy that unifies all three tasks under a single description-style conditioning interface, and a novel architecture Asymmetric Mixture-of-Modality-Experts (A-MoME), which adds a dedicated residual expert for speech while sound and music share the backbone, incurring no inference overhead on non-speech inputs. Experimental results demonstrate that AudioCALM matches modality-specific state-of-the-art and outperforms prior unified baselines on speech, sound, and music generation benchmarks.