Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications
2026-02-12 • Computation and Language
Computation and LanguageMachine LearningSound
AI summaryⓘ
The authors address the challenge of making speech recognition fast and accurate on small devices by introducing Moonshine v2, a model that uses a special local attention method called sliding-window self-attention. This approach lets the model focus on nearby sound bits rather than the whole sentence, reducing delay while keeping accuracy high. Their model matches the accuracy of much bigger models but runs faster and is smaller, making it better for live speech tasks like transcription or translation on edge devices.
LatencyTime-to-First-Token (TTFT)Transformer encoderSelf-attentionSliding-window attentionAutomatic Speech Recognition (ASR)Edge devicesStreaming inferenceWord error rateLocal context
Authors
Manjunath Kudlur, Evan King, James Wang, Pete Warden
Abstract
Latency-critical speech applications (e.g., live transcription, voice commands, and real-time translation) demand low time-to-first-token (TTFT) and high transcription accuracy, particularly on resource-constrained edge devices. Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR) because every frame can directly attend to every other frame, which resolves otherwise locally ambiguous acoustics using distant lexical context. However, this global dependency incurs quadratic complexity in sequence length, inducing an inherent "encode-the-whole-utterance" latency profile. For streaming use cases, this causes TTFT to grow linearly with utterance length as the encoder must process the entire prefix before any decoder token can be emitted. To better meet the needs of on-device, streaming ASR use cases we introduce Moonshine v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference while preserving strong local context. Our models achieve state of the art word error rates across standard benchmarks, attaining accuracy on-par with models 6x their size while running significantly faster. These results demonstrate that carefully designed local attention is competitive with the accuracy of full attention at a fraction of the size and latency cost, opening new possibilities for interactive speech interfaces on edge devices.