FalconTrack: Photorealistic Auto-Labeled Perception and Physics-Aware Vision-Based Aerial Tracking

2026-06-29Robotics

RoboticsArtificial IntelligenceComputer Vision and Pattern Recognition
AI summary

The authors created FalconTrack, a system that helps drones track objects visually without GPS. They used a special simulator to quickly make lots of labeled images automatically, which saves time compared to manual labeling. FalconTrack combines a perception model with physics-based tracking to work well even when moving from simulated data to real-world use. Testing showed their system tracks objects more accurately and reliably than other methods, even in challenging real environments. It runs fast enough on real drones and keeps good performance during quick movements where other methods struggle.

aerial trackingGPS-denied environmentssim-to-real transferGaussian splattingmulti-head perception6-DoF poseEKF (Extended Kalman Filter)photorealistic simulatorzero-shot learningreprojection consistency
Authors
Yan Miao, Karteek Gandiboyina, Noah Giles, Hideki Okamoto, Bardh Hoxha, Georgios Fainekos, Sayan Mitra
Abstract
Vision-based aerial tracking is critical in GPS-denied environments. Reliable perception for tracking depends on large-scale labeled data, yet most photorealistic datasets rely on heavy manual annotation and are time-consuming to produce. We present FalconTrack, a unified perception-and-tracking framework that (i) leverages a photorealistic editable simulator for automated label generation and (ii) combines multi-head perception with physics-aware tracking for zero-shot sim-to-real transfer. FalconTrack provides an automated labeling pipeline in a Gaussian Splatting simulator that isolates target Gaussians from short object videos and composites them with randomized backgrounds to generate RGB, mask, class, and 6-DoF pose labels, producing about 10k labeled images in under 20 minutes. Using this dataset, we train a multi-head perception module with staged learning and reprojection consistency, and fuse its outputs with class-conditioned dynamics priors in an EKF for tracking. Our perception model outperforms two baselines and reaches 96-100% class accuracy in zero-shot sim-to-real transfer on three geometrically diverse objects and two environments, while maintaining consistent performance in unseen simulated and real scenes. In real hardware closed-loop visual tracking, the onboard system runs at about 25 Hz and achieves 100% success in sim-to-real F1-tenth and gate tracking in five trajectories across two environments, while a mask-centered vision baseline drops to 60% success on F1-tenth during fast out-of-view scenarios.