Polycepta: Object-Centric Appearance Estimation for Multi-Object Tracking

2026-06-22 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors propose Polycepta, a new way to track multiple objects by remembering how each object looks over time instead of trying to identify it independently in each video frame. Instead of using static images, Polycepta updates an appearance model for each object as more visual data comes in, improving the recognition of that object. This approach helps the system work better, even for objects it hasn't seen before, and reduces mistakes where the identity of objects gets mixed up. They tested Polycepta on popular datasets and found it runs fast and improves tracking accuracy compared to traditional methods.

multi-object tracking (MOT)tracking-by-detectionappearance descriptorsrecursive estimationidentity switchesmotion predictionobject representationKITTI datasetRobMOTMOTA metric

Authors

Mohamed Nagy, Naoufel Werghi, Jorge Dias, Majid Khonji

Abstract

The tracking-by-detection paradigm in multi-object tracking (MOT) typically relies on static appearance descriptors to complement motion estimation. However, these descriptors are frame-independent, limiting their robustness as visual cues. Since such descriptors are often obtained from computationally intensive pretrained backbones, real-time MOT systems frequently abandon appearance cues altogether and rely solely on motion prediction and geometric association. In this work, we introduce Polycepta, an object-centric appearance state estimation framework that reformulates appearance modeling as a recursive estimation problem rather than a frame-wise matching task. Polycepta constructs and continuously updates an independent appearance state for each tracked object, enabling future appearance representations to be estimated from accumulated observations. Polycepta is encouraged to learn the appearance-state construction of object-specific representations rather than memorize them through a proposed learning strategy, enabling appearance estimation for unseen classes. A key property of Polycepta is that the quality of appearance estimation improves as object states evolve during inference. While conventional appearance descriptors remain static or degrade over time, Polycepta progressively refines appearance estimates as additional observations are accumulated. Extensive experiments on KITTI, the Waymo Open Dataset, and MOT17 demonstrate consistent reductions in identity switches and improvements in tracking performance when integrated into the tracking-by-detection pipelines. Polycepta operates at 90.57 Hz and delivers state-of-the-art performance on the KITTI benchmark when integrated into the RobMOT framework, achieving a MOTA of 92.27\%.

View PDFOpen arXiv