DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation
2026-05-28 • Robotics
RoboticsMachine Learning
AI summaryⓘ
The authors present DynaFLIP, a new method to improve how robots understand motion by training visual perception systems with data that includes images, language, and 3D flow (motion). Instead of relying only on static images or language, their approach aligns these three types of information closely to help the robot better grasp how things move and change in the scene. This leads to better robot manipulation skills, especially in new situations, by focusing on parts of the scene important for action. Their method consistently outperforms existing techniques in both simulated and real-world tests.
robot manipulationvisual encodermultimodal pre-training3D flowcontrastive learningsimplex volume minimizationrepresentation learningout-of-distribution generalizationvision-language alignment
Authors
Jusuk Lee, Seungjae Lee, Jonghun Shin, Hoseong Jung, Sungha Kim, Daesol Cho, H. Jin Kim, Jia-Bin Huang, Furong Huang
Abstract
Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.