Scaling Audio Models Efficiently: A Joint Study of Compute Constraints and Optimization Behavior

2026-06-22 • Sound

SoundArtificial Intelligence

AI summaryⓘ

The authors studied how to best use computer power when building models that understand speech and emotions in speech. They looked at three parts: how big the model is, how much audio time it uses, and how detailed the audio representation is. Their experiments showed that making models bigger helps, but only up to a point. They also found that using about 4 seconds of audio is best for recognizing emotions, and lowering audio detail can save computing without much loss in accuracy. They also showed a way to fine-tune models efficiently with little drop in performance.

Automatic Speech Recognition (ASR)Speech Emotion Recognition (SER)Model sizeInput lengthRepresentation resolutionCompute budgetScaling behaviorWord Error Rate (WER)LoRA adaptationInference cost

Authors

Vyom Agarwal, Mokshda Gangrade, Siddharth Pal, Jerry Wu

Abstract

In this paper, we investigate the tradeoffs between compute allocation and model performance for two speech processing tasks: Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER). We propose a unified framework that analyzes three fundamental compute dimensions: model size ($x_N$), input length ($x_T$), and representation resolution ($x_V$). Motivated by recent advances in compute optimal scaling for multimodal models, we systematically vary these dimensions to examine their influence on task performance under fixed computational budgets. Our study provides insights into how compute resources can be optimally distributed across model capacity, temporal context, and representational granularity, offering practical guidelines for the design of efficient speech models. Through experiments on LibriSpeech and CREMA-D datasets, we demonstrate non-linear scaling behavior and identify optimal operating points. Our results show that (1) increasing model size yields diminishing returns: scaling Tiny (39M) to Small (244M) reduces WER by 8.22%, whereas Small to Medium (769M) reduces WER by only 2.35%; (2) an optimal audio duration of approximately 4 seconds exists for SER; and (3) reducing encoder token resolution provides an effective mechanism for lowering inference cost, Large-v3 (1540M) with 750 frames requires 2572 GFLOPS whereas with 1500 frames requires 5228 GFLOPS, with less than 3% relative increase in WER. Additionally, LoRA-based adaptation enables efficient finetuning with minimal performance degradation.

View PDFOpen arXiv