USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

2026-06-04Computation and Language

Computation and LanguageSound
AI summary

The authors improved an audio encoder called USAD by combining both self-supervised and supervised learning methods to better understand different types of sounds like speech and music. They created USAD 2.0, which uses a technique called domain-aware distillation to help the model learn from teachers that specialize in different audio types. They also made the model bigger and added extra training steps for better performance in real applications. Their tests showed that USAD 2.0 works very well across multiple audio tasks.

audio encoderself-supervised learningsupervised learningdomain-aware distillationfoundation modelsmulti-domain audiomodel scalingspeech recognitionmusic representationlarge language models
Authors
Heng-Jui Chang, Alexander H. Liu, Saurabhchand Bhati, Mrudula Athi, Anton Ratnarajah, Amit Chhetri, James Glass
Abstract
Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.