The Anatomy of the CTC Oracle Gap: Acoustic Exhaustion and Linguistic Recovery

2026-06-22Computation and Language

Computation and LanguageMachine Learning
AI summary

The authors examined methods to improve speech recognition by rescoring multiple output guesses from a CTC model but found no clear gains using internal CTC confidence scores alone, especially as more guesses were considered. They identified that the main limitation lies in capturing language information rather than acoustic detail, and showed that adding external language knowledge from a RoBERTa model helped reduce errors significantly. This combined approach worked well across different datasets, model types, and noisy conditions. However, attempts to fine-tune the system using sequence-level training did not improve results and sometimes made them worse.

CTC (Connectionist Temporal Classification)N-best hypothesis selectionWord error rate (WER)Spearman correlationBlank token proliferationMBR (Minimum Bayes Risk) decodingRoBERTaPseudo-log-likelihood (PLL)Sequence-level fine-tuningRao-Blackwellized REINFORCE
Authors
Ivan Novosad
Abstract
We study the limits of CTC-internal scoring for N-best hypothesis selection and locate the information bottleneck separating acoustic confidence from linguistic plausibility. Eleven CTC-internal and acoustic-feature scoring strategies produce no statistically significant WER improvement over greedy decoding on LibriSpeech dev-other at G=16 (all p > 0.05). The exhaustion is systematic: CTC's Spearman $ρ$ between hypothesis score and per-utterance WER degrades from -0.574 at G=4 to -0.270 at G=128, a 53% loss driven by blank-path proliferation. This establishes that the discriminative capacity of CTC-internal representations is saturated: no recombination of acoustic signals can close the oracle gap. Confirming that the bottleneck is linguistic, not acoustic, external linguistic information introduced via MBR decoding breaks through it. MBR-CER decoding with a RoBERTa pseudo-log-likelihood (PLL) posterior ($τ$=10, G=128) achieves 5.42% WER on held-out LibriSpeech test-other (greedy 5.96%, $Δ$=-0.535 pp, p<0.0001, 9.0% relative). RoBERTa PLL $ρ$ degrades only 21% over the same range, retaining discriminating power where CTC loses it. Applied without retuning across two Zipformer architectures, three domains (LibriSpeech, TED-LIUM 3, VoxPopuli), and four MUSAN noise levels, the recipe gives significant gains in 11 of 13 conditions. On the training side, standard MWER training via the CTC forward-backward algorithm implements Rao-Blackwellized REINFORCE at the output projection (variance about 3x below Viterbi). Yet sequence-level fine-tuning fails at near-converged checkpoints: all four MWER configurations on CR-CTC collapse (+6.18 to +8.90 pp WER), as a training oracle gap of 0.007 pp provides no usable reward signal.