ArtNet: A JEPA-Like Articulatory Predictive Framework for Robust Zero-Shot Phoneme Recognition

2026-06-15Sound

SoundArtificial Intelligence
AI summary

The authors address the challenge of recognizing phonemes in languages the system hasn't seen before, which is difficult because sounds can vary a lot between languages. They created ArtNet, a method that predicts universal speech features based on how speech sounds are formed in the mouth, making it less sensitive to language differences. By combining ArtNet with a new alignment strategy, their approach performed better than existing methods on seven unseen languages, reducing errors in recognizing phonemes and their features.

zero-shot learningcross-lingual phoneme recognitionarticulatory featuresself-supervised learning (SSL)variational information bottleneck (VIB)phoneme error rate (PER)phoneme feature error rate (PFER)joint-embedding predictive architecture (JEPA)vector-space inventory alignment (VSIA)
Authors
Zeqian Hu, Fuliang Weng, Shu Shang, Yaqian Zhou
Abstract
Zero-shot cross-lingual phoneme recognition is often hindered by the fragility of direct acoustic-to-symbol mapping, which is susceptible to language-specific variations. Echoing joint-embedding predictive architecture (JEPA) work in vision, we propose ArtNet, a framework that explores a structured feature prediction task based on articulatory features to enhance acoustic robustness. Specifically, ArtNet integrates an articulatory predictor, designed to extract universal articulatory representations from self-supervised learning (SSL) features, with a variational information bottleneck (VIB) to suppress language-specific variations. Experiments on seven unseen languages demonstrate that ArtNet, particularly when synergized with the proposed vector-space inventory alignment (VSIA) strategy, significantly outperforms competitive baselines, achieving a 20.56\% relative reduction in phoneme error rate (PER) and 7.01\% in phoneme feature error rate (PFER).