Probing Low Frame Rate Degradation in Neural Audio Codecs

2026-06-15Sound

SoundArtificial Intelligence
AI summary

The authors studied why neural audio codecs, which compress speech for synthesis, lose quality at very low frame rates (how often sound is processed). They found that the previously reported sharp drop in quality around 6.25 Hz wasn't due to inherent technical limits, but because the training used clips that were too short, giving the system too few context tokens to work well. After fixing this training issue, the quality dropped more gradually, showing that low frame rate codecs can be efficient without big quality losses. This means generating speech at very low frame rates is more feasible than thought before.

neural audio codecframe rateautoregressive synthesisphonemic collisioncodebook saturationtraining configurationword error rate (WER)clip durationinference costdecoder context
Authors
Alex Gichamba, Moise Busogi
Abstract
Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate degradation remain insufficiently understood. We investigate these mechanisms through a controlled frame rate ablation. We reproduce a quality cliff at 6.25 Hz reported in previous works and evaluate candidate explanations: phonemic collisions and codebook saturation, neither of which shows evidence of a fundamental barrier. The cliff is instead caused by suboptimal training configuration: fixed clip duration during training yields too few tokens at low frame rates, starving the decoder of inter-token context. Once corrected, WER degrades smoothly with phonemic load down to 3.1 Hz and 1.6 Hz, suggesting the inference-time efficiency gains of low frame rate codecs are more accessible than previously assumed.