Real-time body pose non-verbal communication with a consistency-based reliability measure
2026-06-08 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceRobotics
AI summaryⓘ
The authors study how to understand what someone wants to communicate using only their body movements, without relying on facial expressions or speech. They created a new dataset focusing on body poses that represent ten different intentions, useful for robots working in places like rescue missions where quick and simple communication is necessary. They tested various computer models to see how well these can recognize intentions in real time on limited robot hardware. Additionally, they found that if the model’s predictions stay consistent over time, it is more likely to be correct, and they provide a mathematical explanation of this idea.
2D body posecommunicative intentbody movement recognitionskeleton action recognitionreal-time processingembedded GPUautoregressive modelself-consistencyrobot-human communicationdataset benchmarking
Authors
Alina Marcu, Dragos Costea, Cristina Lazar, Marius Leordeanu
Abstract
Body movement communicates intent at distances and in conditions where neither the face, nor speech can be captured. We study the recognition of communicative intent from 2D body pose alone. We argue that body motion is a reliable signal especially in scenarios that require real time low-cost on-device person-to-robot communication in long distance environments, such as rescue missions. However, existing resources do not isolate this signal. Affective corpora combine body, face, voice and text, while skeleton action-recognition benchmarks label the action performed rather than the message conveyed. We release a dataset of real frames of full-body pose covering ten communicative intents and we compare it against other real (IPC) and synthetic (MotionLCM, VEO3.1, Kimodo) ones that span a range of difficulty. We target systems that can run on a robot's limited onboard hardware. We benchmark multiple models, from skeleton graph classifiers to joint motion-forecasting networks, and report performance metrics together with frame rate on an embedded GPU (NVIDIA Orin~Nano), since speed matters as much as accuracy in our scenario. Finally, we show that a model's own autoregressive self-consistency works as an unsupervised reliability signal. We give a short proof that bounds the probability that a self-consistent prediction is correct, show that this probability grows with the number of consistent steps, and identify the conditions under which a confident prediction can still be false, benchmarked against industry-standard metrics.