SUMO: Segment and Track Any Motion with Nonlinear State Space Models

2026-06-29Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors present SUMO, a method that helps computers track and segment moving objects in videos without needing any prior training. They combine ideas from how robots predict movement with visual information to better handle complicated, unpredictable object motions. Their approach uses a special filter to estimate where objects are most likely located over time, improving accuracy. Tests show that SUMO performs very well compared to previous methods.

Visual Object TrackingMoving Object SegmentationState Space ModelNonlinear DynamicsUnscented FilterZero-shot learningObject trackingSegmentationTemporal object dynamicsMemory selection mechanism
Authors
Kexin Tian, Sixu Li, Keshu Wu, Yang Zhou, Zhengzhong Tu
Abstract
Visual Object Tracking (VOT) and Moving Object Segmentation (MOS) are two fundamental tasks in computer vision that involve both spatial and temporal object dynamics. Existing methods rely predominantly on visual cues and thus often falter in real-world scenarios where object motions are inherently complex and nonlinear. To address this limitation, we propose SUMO, a zero-shot, training-free, unified framework integrating nonlinear dynamics with vision-based segmentation for accurate and consistent VOT and MOS. Specifically, we develop a nonlinear State Space Model (SSM) inspired by robotics principles to capture the complex object dynamics. Building on this model, we propose a Selective Unscented Filter (SUF) for accurate state estimation, which features a joint scoring mechanism and dynamically fuses multi-source predictions to identify the most plausible object state over time. Furthermore, we apply a memory selection mechanism to evaluate the reliability of memory frames. Our extensive experimental results show that SUMO achieves state-of-the-art performance on both VOT and MOS tasks.