AI summaryⓘ
The authors investigate if data from early robot tests (pre-deployment rollouts) can help pick the best robot policy later without retraining from scratch. They use a method called RouterVLA, which profiles each policy using separate data and then selects the best one on new data that wasn't used for profiling. In their experiments with many robot trials, a simple rule for picking policies improved success rates significantly, and more complex scoring methods did not add benefit. They also found that reusing the same data to both score and select policies can exaggerate improvements, so it's important to separate evaluation data. Overall, improving individual policies and carefully selecting among them both help build better robot systems.
Robot policy selectionVision-language-action policiesPre-deployment rolloutsRouterVLACross-fittingProbe-success ruleHeld-out successModel scalingCommissioning-aware routingOutcome separation
Authors
Xingyu Ren, Chugang Yi, Ge Ma, Youran Sun
Abstract
We study whether pre-deployment evaluation rollouts can be reused to supervise policy selection. Robot teams routinely smoke test candidate vision-language-action (VLA) policies, then compress those trials into a global winner. RouterVLA evaluates this idea with outcome-disjoint cross-fitting: recorded probes build a profile for each frozen expert, and a separate trial scores the selected expert without entering its profile. Across 34,752 LIBERO-Plus rollout records, a transparent probe-success rule raises held-out success from 0.4686 to 0.6149, a +14.64pp gain. Under the scalar-only profiles studied here, learned scorers are statistically indistinguishable from this rule, showing that commissioning carries the routing value while extra scalar scorer capacity does not create it. Reusing the scored trial inflates the measured gain by $1.87\times$, so credible ledger routing needs outcome separation; model scaling improves individual policies, while commissioning-aware routing improves the system built from them.