The Impact of VAE Design on Latent Pose Representations for Diffusion-based Sign Language Production
2026-06-22 • Artificial Intelligence
Artificial IntelligenceComputer Vision and Pattern Recognition
AI summaryⓘ
The authors study how different designs of a special kind of neural network, called a variational autoencoder (VAE), affect how sign language poses are represented in a simplified form called the latent space. They find that traditional ways to measure how well the network works, like how accurately it can rebuild the poses, don’t fully explain how good it is at helping generate new sign language from text. Instead, the structure of this latent space better predicts the success of generating sign language videos. Their experiments with the Phoenix14T dataset show that better generative results are linked to certain latent space qualities rather than just reconstruction accuracy.
latent diffusionsign language productionvariational autoencoderlatent spaceautoencoderreconstruction qualitygenerative modeltext-to-sign generationPhoenix14T datasetback-translation BLEU score
Authors
Guilhem Fauré, Mostafa Sadeghi, Sam Bigeard, Slim Ouni
Abstract
Latent diffusion approaches to sign language production (SLP) rely on an initial stage that learns an encoding of sign pose sequences, enabling generative modeling in the resulting latent space. The autoencoder used in this stage is typically evaluated in terms of reconstruction quality using geometric metrics common in SLP. While informative, these metrics do not fully capture latent space properties that may influence the training and performance of the downstream generative model. In this work, we investigate how architectural and training objective design choices in a variational autoencoder (VAE) for sign pose encoding affect latent space structure, and how these differences translate into the performance of a latent diffusion model for text-to-sign generation. Our experiments on Phoenix14T dataset show that variations in generative performance, measured through back-translation BLEU scores, can sometimes be better explained by differences in latent space properties than by VAE reconstruction accuracy alone.