Rethinking Object-Centric Representations for Video Dynamics Modeling

2026-06-22 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors study how to track objects in videos without any labels, by representing objects as fixed groups of information called slots. They point out a problem where mixing an object's appearance and position makes tracking confused, often focusing on background instead of moving objects. Their solution, called STAITUS, separates appearance from position and changes how consistency is enforced, which helps keep object identities clear even when objects move or overlap. They also created a way for the system to decide how many objects are present based on the scene's complexity. Their tests show this method works better than previous ones at both identifying and consistently tracking objects.

unsupervised video object trackingslot-based representationstemporal consistencyappearance-pose disentanglementspatial separationtemporal alignmentmask segmentationadaptive gatingobject identityover-segmentation

Authors

Amaury Wei, Ismail Nejjar, Olga Fink

Abstract

Unsupervised video object tracking aims to decompose dynamic scenes into persistent, object-centric entities without manual annotations. Many recent approaches rely on slot-based representations, where a fixed set of latent variables ("slots") represent individual objects across frames. To preserve object identity, these models enforce temporal consistency on slot embeddings. However, when appearance and pose are entangled, this consistency objective conflicts with object motion and viewpoint changes. As a result, slots tend to lock onto static regions (e.g., background) to satisfy the consistency objective, while foreground objects become fragmented across multiple slots or frequently swap identities. To address these limitations, we propose STAITUS, a unified framework that explicitly disentangles each slot into appearance and geometric pose (position/scale). Leveraging this disentanglement, STAITUS enforces within-frame spatial separation and applies temporal alignment only in appearance space, yielding sharper masks and more persistent identities under motion, occlusion, and object entry/exit. Furthermore, to mitigate over-segmentation, we introduce an adaptive gating mechanism that dynamically adjusts the number of active slots to match scene complexity. Extensive experiments on synthetic and real-world benchmarks demonstrate that STAITUS substantially outperforms state-of-the-art baselines in segmentation quality and tracking stability.

View PDFOpen arXiv