Conditional Collapse in Sign Language Production: A Diagnostic and a Scaling Argument
2026-06-01 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors studied how well computers can create sign language animations from written text. They found that common ways to measure quality, like motion similarity and translation scores, don’t always show if the animation truly matches the signed meaning. Instead, they suggest checking three things separately: starting pose, variety of signs, and how closely the signs match the original meaning. Testing many models showed that the animations rarely matched the meaning well, especially in longer sentences, but did better on simpler isolated signs. This points to the need for larger datasets with full sentence examples to improve sign language generation.
Sign Language ProductionFréchet Inception Distance (FID)Back-translation BLEU scoreHow2Sign datasetMotion autoencoderNeural Sign Actors (NSA)Gloss datasetOutput diversityTarget faithfulnessLatent representation
Authors
Rui Hong, Jana Košecká
Abstract
Sign Language Production (SLP) is the task of generating avatar sign language motion from natural language text. The quality of the generated motion is typically evaluated by a motion-space Fréchet distance (FID) and back-translation (BT) BLEU score on benchmarks such as How2Sign. Both metrics can improve substantially while the underlying generator fails to faithfully represent the sign language gestures. In this work we propose to evaluate the generated motion at three independent levels: (τ1) initial-pose conditioning, (τ2) output diversity, and (τ3) target faithfulness. We compute these as pairwise-distance ratios using latent representations of a frozen motion autoencoder (MoAE). We evaluate 14 SLP model checkpoints on the How2Sign dataset, including a re-implemented Neural Sign Actors (NSA), and show that τ3 faithfulness is never attained, while FID varies by nearly two orders of magnitude and is uncorrelated with faithfulness. We show that on the isolated gloss dataset ASL3DWord favorable τ3 can be attained, hence isolating the size of the sentence-level paired-dataset as the bottleneck.