MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion
2026-06-08 • Sound
Sound
AI summaryⓘ
The authors improved a real-time voice conversion system called MeanVC by creating MeanVC 2. Their new method uses a technique called future-receptive chunking to better handle small audio chunks, making the voice conversion more stable and faster. They also designed a new timbre encoder that builds a speaker’s voice features more reliably, even when the reference audio is low quality. These changes significantly improved conversion quality and reduced the delay from 211 ms to 110 ms.
voice conversionzero-shotstreamingdiffusion transformerchunk-wise processingtimbre encodermel-spectrogramlatencyfuture-receptive chunkingcross-attention
Authors
Guobin Ma, Yuxuan Xia, Yuepeng Jiang, Dake Guo, Hanke Xie, Jingbin Hu, Yanbo Wang, Lei Xie, Pengcheng Zhu
Abstract
Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.