Cloak: Zero-Shot Cross-Embodiment Manipulation by Masking the End-Effector from the VLA

2026-06-22Robotics

Robotics
AI summary

The authors developed Cloak, a method that helps a robot see and understand tasks without being confused by its own hand in the camera view. By hiding the robot’s hand in the camera image during training, the model can learn to work regardless of the type of robot hand it uses. They tested this on a model trained with one type of robot gripper and showed it works on different robot hands without extra data. This approach means the robot’s visual training can be reused even if the robot hardware changes.

Vision-Language-Action modelzero-shot transferend-effectorwrist cameramaskingrobot embodimentsimulationparallel-jaw grippercross-embodiment transfer
Authors
Michael Piseno, Guy Tevet, C. Karen Liu
Abstract
We present Cloak, a training recipe that endows a Vision-Language-Action (VLA) model with zero-shot cross-embodiment transfer by cloaking the end-effector from its own wrist camera. The end-effector occupies a large and consistent region of the wrist view and masking it allows for embodiment-agnostic visual reasoning. Cloak renders a mask in simulation from the robot's known geometry, accurately and in real time, with no segmentation or generative models. During training, we augment the mask so the model generalizes to embodiments unseen at training time. We demonstrate the recipe with Cloak-VLA, a VLA trained with Cloak on a single parallel-jaw gripper dataset. No data of new embodiments is ever collected. Cloak-VLA transfers zero-shot to various unseen embodiments, including another gripper, another arm, and a five-fingered hand, while preserving the source embodiment's performance. By decoupling the wrist view from its own embodiment, Cloak allows data to outlive the hardware it was collected on.