Bootstrapping Sign Language Annotations with Sign Language Models

2026-04-08Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors address the problem that sign language AI systems lack enough detailed labeled data because labeling is expensive. They created a method that automatically guesses annotations for sign language videos using a combination of their own finger spelling and isolated sign recognition models plus a few-shot learning approach. Their models perform very well on benchmark tasks, and they also had a professional interpreter create accurate labels on about 500 videos as a gold standard. They are releasing both the human-labeled data and over 300 hours of automatically generated annotations to help the field.

sign language interpretationannotationfingerspelling recognitionisolated sign recognitionfew-shot learningglossesASL STEM Wikipseudo-annotationmachine learning datasettop-1 accuracy
Authors
Colin Lea, Vasileios Baltatzis, Connor Gillis, Raja Kushalnagar, Lorna Quandt, Leah Findlater
Abstract
AI-driven sign language interpretation is limited by a lack of high-quality annotated data. New datasets including ASL STEM Wiki and FLEURS-ASL contain professional interpreters and 100s of hours of data but remain only partially annotated and thus underutilized, in part due to the prohibitive costs of annotating at this scale. In this work, we develop a pseudo-annotation pipeline that takes signed video and English as input and outputs a ranked set of likely annotations, including time intervals, for glosses, fingerspelled words, and sign classifiers. Our pipeline uses sparse predictions from our fingerspelling recognizer and isolated sign recognizer (ISR), along with a K-Shot LLM approach, to estimate these annotations. In service of this pipeline, we establish simple yet effective baseline fingerspelling and ISR models, achieving state-of-the-art on FSBoard (6.7% CER) and on ASL Citizen datasets (74% top-1 accuracy). To validate and provide a gold-standard benchmark, a professional interpreter annotated nearly 500 videos from ASL STEM Wiki with sequence-level gloss labels containing glosses, classifiers, and fingerspelling signs. These human annotations and over 300 hours of pseudo-annotations are being released in supplemental material.