VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring

2026-06-02 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionMachine LearningRobotics

AI summaryⓘ

The authors created a system called VLESA that watches what a person is doing through a wearable camera and predicts if their next actions might be unsafe. This system is smart because it understands that the same action can be safe or dangerous depending on the person's goal. They built a special dataset to help train VLESA to judge safety based on what the person intends to do, without needing to retrain the system for new goals. Their approach works better than previous methods at catching unsafe actions in real time. They also made their code publicly available for others to use.

egocentric videointent-dependent safetygoal-conditioned safetyQ-filterGRPOembodied AIreal-time interventionASIMOV-2.0 benchmarkaction predictionconstrained decoding

Authors

Hanjiang Hu, Yiyuan Pan, Jiaxing Li, Xusheng Luo, Alexander Robey, Na Li, Yebin Wang, Changliu Liu

Abstract

As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount -- physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. VLESA addresses intent-dependent safety where identical actions can be safe or dangerous depending on context. A dataset pairing egocentric frames with goal-conditioned safety annotations is introduced, enabling a goal-conditioned safety Q-filter trained via GRPO that evaluates actions with respect to inferred intent without retraining. On top of that, an intent-action prediction agent is proposed to jointly infer goals and predict future actions from video. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy at the exact ground-truth frame compared to baselines, while the GRPO-trained Q-filter improves action safety by over 41 percentage points through goal-conditioned constrained decoding. Code is available at https://github.com/HanjiangHu/VLESA.

View PDFOpen arXiv