Toward Signing Activity Projection in Sign Language Interaction

2026-06-08Computation and Language

Computation and Language
AI summary

The authors studied how social robots can predict when a person using sign language will finish their turn in a conversation, similar to how some systems do for spoken language. They adapted a method called Voice Activity Projection (VAP) to analyze videos of signed conversations and focused on hand, eye, and mouth movements. Their results show that predicting when someone will hold or shift their turn works fairly well using hand movements, but it's still hard to predict exact turn shifts. They conclude that methods from speech interaction need adjustment to work well with sign language.

Social robotsTurn-takingSign languageVoice Activity Projection (VAP)Dyadic interactionSHIFT/HOLD predictionPose estimationLexical signsModalitiesPublic DGS Corpus
Authors
Takao Obi, Wang Yusong, Koji Inoue, Kotaro Funakoshi
Abstract
Social robots must interact robustly not only with users assumed by speech-centered systems but also with diverse users whose communication relies on different modalities, e.g., sign language. One important capability gap is predictive turn-taking with signing users. Although Voice Activity Projection (VAP) has been successfully used to model future voice activity in spoken interaction, it remains unclear whether the framework transfers to sign language interaction. This paper presents an initial transfer study of adapting a VAP architecture to dyadic sign language interaction. Using interaction recordings from the Public DGS Corpus, we derive binary signing activity streams from lexical sign annotations and formulate proxy tasks for turn-taking prediction. The model uses pose-derived hand, eye-region, and mouth-region features extracted for each signer. The results show that SHIFT/HOLD prediction is promising, especially with hand cues, while SHIFT-prediction remains difficult. These findings provide initial evidence for both the promise and the current limitations of transferring predictive turn-taking models from spoken interaction to sign language interaction. Predictive modeling of sign language interaction still requires sign-language-specific event definitions that go beyond speech-derived categories.