Critical Interval MSE: Toward Reliable Offline Validation for Robot Manipulation Policies

2026-06-29 • Robotics

RoboticsArtificial Intelligence

AI summaryⓘ

The authors address the problem that testing robot policies in the real world is slow and expensive, making it hard to quickly improve them. They propose a new way to check how well a robot policy might work without actually running the robot, called Critical Interval MSE (CI-MSE). This method focuses on important parts of the task and aligns actions better to match real robot behavior. Their tests show CI-MSE predicts real-world success much better than usual error measures, helping researchers improve robot policies more efficiently.

robot policiesreal-world evaluationvalidation lossmean squared error (MSE)offline validationpolicy iterationsimulationSpearman's rank correlationdistribution shift

Authors

Haoxu Huang, Tongsam Zheng, Yifan Chen, Jiacheng You, Yang Gao

Abstract

Real-world evaluation is the gold standard for robot policies because it tests them against the physical conditions and deployment challenges they are ultimately designed to handle. However, real-world evaluation is also the bottleneck for iterating on robot policies: it is costly, difficult to reproduce, and often too sparse to reliably compare nearby model variants. A straightforward proxy for performance is validation loss on expert demonstrations, but this proxy is often poorly correlated with real-world performance. In this paper, we introduce Critical Interval MSE (CI-MSE), an intuitively simple yet effective offline validation metric. CI-MSE restricts error computation to task-critical segments and pairs it with simple action-alignment procedures that better match rollout-time behavior. Across simulation and real-world experiments, CI-MSE yields a stronger correlation between validation error and rollout performance than raw MSE. Across a wide range of policy checkpoints, CI-MSE achieves a Spearman's rank correlation of $-0.87$, much closer to the ideal value of $-1$ than raw MSE's $-0.61$, demonstrating a significant improvement. We show through sensitivity analysis that our metric is robust to a wide range of hyperparameters. We further study the effectiveness of CI-MSE under evaluation distribution shifts and suggest design boundaries when using this metric. In summary, this paper provides a simple and reliable offline validation tool for accelerating policy iteration. Project webpage: https://ci-mse.github.io/

View PDFOpen arXiv