LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

2026-06-22 • Robotics

Robotics

AI summaryⓘ

The authors focus on testing the safety of Vision-Language-Action (VLA) models, which are designed to control robots using visual and language inputs. They created a new way to generate many different safety-critical scenarios and collected a large dataset of safe robot actions using a method that avoids human effort. They tested several models and found that while diverse training helps robots act more safely, success is still limited by how well the models plan movements and understand instructions. Their work provides tools and insights to help build safer robot controllers in the future.

Vision-Language-Action modelsrobot manipulationsafety benchmarksparametric scenario generationkeypose-driven datadomain randomizationtrajectory synthesissemantic alignmentembodied foundation models

Authors

Rongxu Cui, Zongzheng Zhang, Jingrui Pang, Haohan Chi, Jinbang Guo, Saining Zhang, Shaoxuan Xie, Xin Jin, Yao Mu, Jiaolong Yang, Guocai Yao, Xianyuan Zhan, Ya-Qin Zhang, Hao Zhao

Abstract

Despite the impressive manipulation capabilities of Vision-Language-Action (VLA) models, their operational safety under strict constraints remains largely unverified. To address this, we introduce a parametric safety benchmark to procedurally generate safety-critical scenarios with comprehensive stochasticity. To overcome the scalability bottlenecks of human teleoperation, we develop a novel keypose-driven data generation pipeline. Leveraging this infrastructure, we curate a large-scale dataset of 19,664 strictly collision-free demonstrations with extensive domain randomization. We then conduct a systematic cross-paradigm evaluation of eight VLA and two embodied foundation models. Our analysis reveals a critical generalization-safety tension: although high-diversity training fosters safer trajectories, task success remains fundamentally bottlenecked by sub-optimal trajectory synthesis and semantic misalignment. By providing a scalable pipeline, a robust dataset, and profound failure-mode insights, LIBERO-Safety establishes a crucial foundation for developing safe and reliable VLA models.

View PDFOpen arXiv