TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs

2026-04-09 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors developed TASU2, a new method to help speech models learn from text without needing lots of paired audio data. TASU2 can control how much error appears in simulated speech signals, making the training process smoother and more effective. Their approach improves how well models recognize speech in different conditions, doing better than previous methods that also try to learn from text or use synthetic speech. TASU2 also avoids hurting performance on the original speech data.

Speech LLMCTC simulationCross-modal alignmentWER (Word Error Rate)Post-training curriculumText-only supervisionLow-resource adaptationTTS (Text-to-Speech)Speech recognitionDomain adaptation

Authors

Jing Peng, Chenghao Wang, Yi Yang, Lirong Qian, Junjie Li, Yu Xi, Shuai Wang, Kai Yu

Abstract

Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributions under a specified WER range, producing text-derived supervision that better matches the acoustic decoding interface. This enables principled post-training curricula that smoothly vary supervision difficulty without TTS. Across multiple source-to-target adaptation settings, TASU2 improves in-domain and out-of-domain recognition over TASU, and consistently outperforms strong baselines including text-only fine-tuning and TTS-based augmentation, while mitigating source-domain performance degradation.

View PDFOpen arXiv