SkyJEPA: Learning Long-Horizon World Models for Zero-Shot Sim-to-Real Control of Quadrotors

2026-06-22 • Robotics

RoboticsMachine Learning

AI summaryⓘ

The authors focus on improving how drones predict their future movements to fly better in tricky situations. They use a special method called JEPA that looks at drone dynamics in a hidden, simpler space instead of the usual way that often makes errors grow over time. Their new model also includes a physics-inspired tool to translate hidden information into understandable drone states, helping with longer and more accurate predictions. They tested this model both in the lab and outside, showing it works well for fast drone control and transfers smoothly from simulation to real-world flying without extra training. They also created a way to gather training data safely and efficiently, which reduces risky real-world tests.

Neural network dynamics modelsJoint Embedding Predictive Architectures (JEPA)Latent spaceAutoregressive rolloutQuadrotor controlSim-to-real transferSampling-based optimal controlDataset generationRobotic navigationPhysics-inspired probing

Authors

Pratyaksh Rao, Wancong Zhang, Randall Balestriero, Yann LeCun, Giuseppe Loianno

Abstract

Accurate dynamics models are critical for informed decision-making in robotic systems, particularly for agile aerial vehicles operating under uncertainty. Neural network dynamics models are attractive for capturing complex nonlinear effects, but existing predictive approaches struggle with long-horizon forecasting because their autoregressive rollout mechanism amplifies errors over time. Joint Embedding Predictive Architectures (JEPAs) offer a compelling alternative by modeling dynamics in latent space, yet prior JEPA-style methods for robot navigation have been studied primarily for kinematic-level planning, with limited investigation in high-frequency control. In this work, we introduce the JEPA-style model for real-time quadrotor control. The proposed approach combines a latent dynamics model with a novel physics-inspired prober that maps frozen latents to interpretable state, enabling physically grounded long-horizon prediction. Additionally, we combine the learned model with a sampling-based optimal control solution to take advantage of its predictive capabilities for real-time control on embedded hardware. Finally, to reduce the dependence on expensive and unsafe real-world data collection, we develop a structured pipeline for automated dataset generation. Extensive open-loop and outdoor closed-loop experiments demonstrate accurate prediction, robust zero-shot sim-to-real transfer, and strong generalization across diverse operating conditions.

View PDFOpen arXiv