RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionComputation and LanguageRobotics
AI summary

The authors created RoboTrustBench, a new test to check how well video models for robots can be trusted, especially when instructions are tricky or unsafe. They tested seven popular video models on real robot videos and found that while the models make good-looking videos, they have trouble understanding rules, imagining 'what if' situations, handling physical actions, and ignoring bad instructions. The study shows that just making videos look right and following simple commands isn't enough for robots to be trusted. This work helps point out where robot video models need to improve.

video world modelsrobotic manipulationbenchmarkconstraint reasoningcounterfactual scenariosphysical interactionsadversarial instructionsinstruction-image pairsrobot trustworthinessmodel evaluation
Authors
Huiqiong Li, Jiayu Wang, Zhiting Mei, Anirudha Majumdar, Jingjing Chen, Bin Zhu
Abstract
Video world models are increasingly used in robotic manipulation, yet existing benchmarks mostly evaluate them under valid, feasible, and safe instructions. We introduce RoboTrustBench, a benchmark for evaluating the trustworthiness of video world models under four scenarios: Normal, Constraint-Sensitive, Counterfactual, and Adversarial. Built from real-world DROID episodes, RoboTrustBench contains 1,207 expert-validated instruction-image pairs and a six-dimensional evaluation protocol with 13 fine-grained criteria. Evaluating seven representative video world models with human and MLLM assessment, we find that current models often generate visually coherent videos, but struggle with constraint reasoning, counterfactual grounding, physical interaction, and unsafe-instruction suppression. These results show that visual quality and surface-level instruction following are insufficient for trustworthy robotic video world modeling.