FATE-VLA:Failue-aware test generation for vision-language-action models
2026-06-01 • Robotics
Robotics
AI summaryⓘ
The authors explain that current ways of testing robot models that understand vision and language often miss many problems because they check only random scenes. They suggest a new way to find these problems actively by focusing on risky but varied situations, using smart tools that learn from past tests. When they tried this on four top robot models, they found many more failures and different kinds of mistakes than usual tests did. This shows that testing should move from just random checks to more thoughtful searches that better find where models might fail before using them in real life.
Vision-Language-Action modelsrobot policiesbenchmarkingfailure discoverytest generationdiversity-driven explorationsurrogate modelsmodel robustnessembodied AIadaptive testing
Authors
Arusa Kanwal, Pablo Valle, Shaukat Ali, Aitor Arrieta
Abstract
Vision-Language-Action (VLA) models are increasingly used as generalist robot policies, yet their evaluation still relies largely on static benchmarks that randomly sample task scenes. In high-dimensional embodied spaces, failures are sparse and clustered, so static benchmarking can underestimate robustness risks. We reframe VLA evaluation as an active failure-discovery problem and propose a failure-aware test-generation approach that combines diversity-driven exploration with surrogate models learned from observed executions. The method steers testing toward high-risk yet diverse scene regions. Across four state-of-the-art VLA models, it uncovers substantially more failures (up to +29.7 % over selected baselines) while revealing more diverse failure modes. This mean that, for instance, in the case of GR00T-N1.6, success rate dropped from 64.4% to 34.7%. More broadly, our findings call for a shift in VLA evaluation: from passive measurement on fixed task suites to adaptive, failure-seeking test generation that exposes the structure of model weaknesses before deployment.