TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
2026-05-11 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors point out that current video language models are good at understanding videos generally but struggle to keep track of the same object over time, especially when it moves, disappears, or changes. To test this better, they created TOC-Bench, a special test set focused on whether models can consistently follow objects through different video frames. They carefully filtered and verified questions to make sure answers rely on understanding the sequence of visual events. Their tests show that existing models still have big problems with tracking objects consistently and understanding events in the right order.
Video large language modelsTemporal object consistencyObject trackingVideo understanding benchmarksTemporal reasoningEvent countingEvent orderingQA pairsObject-centric coherence
Authors
Junzhe Chen, Siyuan Meng, Yuxi Chen, Man Zhao, Xiaojie Guo
Abstract
Video large language models (Video-LLMs) have achieved remarkable progress in general video understanding, yet their ability to maintain temporal object consistency remains insufficiently explored. Existing benchmarks primarily focus on event recognition, action understanding, or coarse temporal reasoning, but rarely evaluate whether a model can consistently preserve the identity, state, and temporal continuity of the same object across occlusion, disappearance, reappearance, state transitions, and cross-object interactions. As a result, current evaluations may overestimate temporal reasoning ability while overlooking failures in object-centric temporal coherence. To address this issue, we introduce TOC-Bench, a diagnostic benchmark specifically designed to evaluate temporal object consistency in Video-LLMs. TOC-Bench is explicitly object-track grounded, where each queried subject is associated with a per frame object trajectory and structured temporal event timeline. To ensure that benchmark items depend on temporally ordered visual evidence rather than language priors, single-frame shortcuts, or unordered frame cues, we propose a three-layer temporal-necessity filtering protocol that removes 60.7% of candidate QA pairs and retains 17,900 temporally dependent items spanning 10 diagnostic dimensions. From this filtered pool, we further construct a human-verified benchmark containing 2,323 high-quality QA pairs over 1,951 videos. Experiments on representative Video-LLMs show that temporal object consistency remains a major unsolved challenge. Current models exhibit substantial weaknesses in event counting, event ordering, identity-sensitive reasoning, and hallucination-aware verification, despite strong performance on general video understanding benchmarks.