MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry

2026-06-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics

AI summaryⓘ

The authors developed MVOFormer, a new method to help robots understand how they are moving using only one camera. Their system combines two types of information: motion details and knowledge about objects, which helps it ignore moving distractions. They also use a special process to gradually improve the robot’s guess about its position while focusing on reliable parts of the image. Tests show their approach works well across different environments without needing extra training.

Monocular Visual OdometryTransformerFlow-Semantic EncodingPose EstimationZero-Shot GeneralizationGeometric MotionSemantic PriorsIterative DecodingRobotic LocalizationDynamic Distractors

Authors

Jituo Li, Shunwang Sun, Jialu Zhang, Xinqi Liu, Jinyao Hu, Zhicheng Lu, Sajad Saeedi, Guodong Lu

Abstract

Monocular visual odometry (MVO) is foundational to autonomous navigation and robotic localization. However, existing learning-based MVO approaches often struggle with either a lack of interpretable, complementary features or overly complex multi-stage architectures. These limitations inherently restrict their robustness and cross-domain generalization. In this work, we propose MVOFormer, a novel transformer framework for robust monocular visual odometry. Our architecture features a Flow-Semantic Dual Branch Encoder that synergizes dense geometric motion cues with object-centric semantic priors, explicitly distinguishing static structures from dynamic distractors. These representations are then fused by an Iterative Multimodal Decoder, enabling coarse-to-fine pose refinement while dynamically suppressing attention on unreliable regions. Extensive evaluations demonstrate that, without any target-domain fine-tuning, MVOFormer achieves superior zero-shot generalization and robustness, significantly outperforming prior learning-based frame-to-frame methods across diverse benchmarks including TartanAir, KITTI, TUM-RGBD, and ETH3D-SLAM.

View PDFOpen arXiv