GMOS: Grounding Moving Object Segmentation in 3D Space and Time
2026-05-28 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors focus on improving Moving Object Segmentation (MOS), which is about finding and tracking objects that move independently in videos. They point out that earlier methods miss important 3D information and don’t consider how objects move at every moment. To fix this, the authors created GMOS, a new method that works directly with RGB videos to better understand and segment moving objects in 3D and over time. They also built a new dataset (GMOS-2K) and an evaluation method (MOS-I) to test how well their approach works. Their method achieves top performance while running faster than previous ones and can work in real-time settings.
Moving Object Segmentation3D geometric informationOptical flowVideo Object SegmentationTemporal motionRGB videoOnline inferenceMulti-object trackingBenchmark dataset
Authors
Junyu Xie, Tengda Han, Weidi Xie, Andrew Zisserman
Abstract
Moving Object Segmentation (MOS) aims to discover, segment, and track objects that move independently of the camera. Current MOS methods, however, exhibit two fundamental limitations: they rely on pre-computed 2D auxiliary modalities such as optical flow or point trajectories that lack 3D geometric information, and they treat motion as a sequence-level attribute, overlooking the instantaneous motion state of each object. We address both by grounding MOS in 3D space and time, and propose GMOS, a framework that operates directly on RGB video to produce 3D-aware, temporally fine-grained segmentation of multiple moving objects, alongside a foreground--background variant GMOS-S for faster deployment. To support training and evaluation in this regime, we curate GMOS-2K, a dataset of 2,210 real-world videos with per-object temporal motion annotations drawn from five established Video Object Segmentation (VOS) benchmarks, and formalise MOS-I ("I" for instantaneous), a temporally fine-grained evaluation protocol with three complementary metrics. GMOS achieves state-of-the-art results across MOS, MOS-I, and Unsupervised VOS benchmarks, while running significantly faster than prior multi-object MOS methods and supporting online inference for streaming deployment.