Clip-level Uncertainty and Temporal-aware Active Learning for End-to-End Multi-Object Tracking
2026-05-11 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors focus on tracking multiple moving objects in videos, which requires following objects over time to keep their identities correct. They note that training modern tracking models demands a lot of labeled data, which is expensive to get. To use fewer labels, they propose a new method called CUTAL that chooses small video clips to label instead of individual frames, making better use of data by considering uncertainty and time-based diversity. Their method performs better than previous approaches and can match fully supervised results using only half the labeled data on tested models and datasets.
Multi-Object TrackingTransformerActive LearningTemporal ReasoningUncertaintyClip-level LabelingBounding-box AnnotationIdentity AnnotationTemporal DiversityMeMOTR
Authors
Riku Inoue, Shogo Sato, Kazuhiko Murasaki, Tomoyasu Shimada, Toshihiko Nishimura, Ryuichi Tanida
Abstract
Multi-Object Tracking (MOT) in dynamic environments relies on robust temporal reasoning to maintain consistent object identities over time. Transformer-based end-to-end MOT models achieve strong performance by explicitly modeling temporal dependencies, yet training them requires extensive bounding-box and identity annotations. Given the high labeling cost and strong redundancy in videos, Active Learning (AL) is an effective approach to improve annotation efficiency. However, existing AL methods for MOT primarily operate at the frame level, which is structurally misaligned with modern end-to-end trackers whose inference and training rely on multi-frame clips. To bridge this gap, we formulate clip-level active learning and propose Clip-level Uncertainty and Temporal-aware Active Learning (CUTAL). In contrast to frame-based approaches, CUTAL scores each clip using uncertainty metrics derived from multi-frame predictions to capture inter-frame correspondence ambiguities, while enforcing temporal diversity to select an informative and non-redundant subset. Experiments show that CUTAL achieves stronger overall performance than baselines at the same label budgets across MeMOTR and SambaMOTR. Notably, CUTAL achieves performance comparable to full supervision for MeMOTR on both datasets using only 50% of the labeled training data.