MOSS-Audio Technical Report

2026-06-01 • Sound

SoundArtificial Intelligence

AI summaryⓘ

The authors created MOSS-Audio, a single AI model that can understand different kinds of sounds like speech, music, and noises from the environment. It combines special parts to process sound over time and convert it into text that describes or answers questions about the audio. They introduced new ways to give the model detailed time information and layered sound features to improve understanding. The model was trained on lots of labeled audio data and improved further for tasks like transcribing speech with timestamps and following instructions. Overall, MOSS-Audio shows good performance in many audio-related tasks, making it useful for future voice-based applications.

audio-language modelaudio encodertemporal representationautoregressive text generationevent-preserving annotationtimestamped transcriptioninstruction fine-tuningaudio-grounded reasoningspeech captioningpretraining

Authors

Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu, Jingqi Chen, Ke Chen, Wenxuan Wang, Yang Wang, Yaozhou Jiang, Yi Jiang, Zhengyuan Lin, Ziqi Chen, Zhaoye Fei, Chenghao Liu, Jun Zhan, Kang Yu, Kexin Huang, Mingshu Chen, Qinyuan Cheng, Ruixiao Li, Shimin Li, Songlin Wang, Yang Gao, Yiyang Zhang, Xipeng Qiu

Abstract

MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: \textbf{DeepStack cross-layer feature injection}, which exposes the decoder to acoustic information from multiple encoder depths, and \textbf{time markers}, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.

View PDFOpen arXiv