Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space
2026-06-01 • Sound
SoundArtificial Intelligence
AI summaryⓘ
The authors developed Echo, an audio system using one vision transformer (ViT) model to handle multiple speech tasks simultaneously without extra fine-tuning. This model is trained to understand speaker identity, speech content, and manage different sound sources all in one shared space. They tested it on mixed audio samples and showed it can separate speakers and recognize speech well. They focus on showing how these tasks can work together efficiently, not just on beating best scores. The paper also explains the challenges and limits they found along the way.
Vision Transformer (ViT)JEPA objectiveSpeaker diarizationSource separationLatent spacePIT (Permutation Invariant Training)SI-SDR (Scale-Invariant Signal-to-Distortion Ratio)ArcFaceVoxCeleb2k-NN probe
Authors
Louis Mouchon
Abstract
We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing in the same 512-dimensional latent space, with no per-task fine-tuning at deployment. Light heads handle diarization (ArcFace + VBx) and dynamic source separation (null-target K-set prediction). On synthetic VoxCeleb2 mixtures with unknown K, the canonical stack reaches 15.00% blind DER, 97.80% PIT separation accuracy with +9.52 dB latent SI-SDR, and a +53.50-point speaker/content factorisation gap on a held-out k-NN probe. The point of Echo is not a new SOTA on any single task but the joint coexistence of three tasks on one encoder at this footprint. We document the design stage by stage, report the dead-ends, and identify the structural wall on end-to-end ASR through the VQ bottleneck that still bounds the PoC.