Chronological Blindness: Benchmarking Temporal Reasoning in Vision-Language Models with CHRONOSIGHT
2026-06-15 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors created CHRONOSIGHT, a test to see how well large vision-language models understand the timing and order of events in pictures, like whether fruit is ripening or rotting. They tested various models and found humans understand these timelines much better than the models do, showing a big gap called 'chronological blindness.' With some additional fine-tuning, one model improved on time estimation tasks, indicating that the challenge might be more about following instructions than recognizing images. The benchmark covers many types of changes happening over different timescales, from minutes to thousands of years.
vision-language modelstemporal reasoningchronological orderingfine-tuningtime elapsed estimationvisual perceptionbenchmarkpromptingzero-shot learning
Authors
Parthaw Goswami, Jaynto Goswami Deep
Abstract
Human perception of visual scenes is inherently temporal. We instinctively recognise whether a fruit is ripening or rotting, whether construction is progressing or being demolished, and approximately how much time separates two photographs of the same subject. Whether large vision-language models (VLMs) share this competence remains an open and practically important question. We introduce CHRONOSIGHT, a rigorously controlled benchmark evaluating five dimensions of visual temporal reasoning: CHRONORANK (chronological ordering of image sequences), CHRONOLOCATE (ordinal stage localisation from a single image), CHRONODELTA (estimation of time elapsed between two images on a logarithmic scale), CHRONOREVERSE (detection of temporally reversed sequences), and CHRONOODD (identification of a temporal outlier within a set). The benchmark comprises 1{,}000 items across eight process families (biological growth, food transformation, physical weathering, construction, environmental change, human ageing, astronomical phenomena, and urban dynamics) spanning timescales from minutes to millennia. We evaluate eight open-source VLMs (500 M to 19 B parameters) under two prompting regimes and collect human performance baselines. Human performance averages 0.89 across tasks; the best open model (Qwen2.5-VL-7B) reaches 0.40 under direct prompting, a gap we term chronological blindness. Lightweight LoRA fine-tuning on 151 examples raises CHRONODELTA accuracy from near-zero to 0.43, transferring zero-shot to related tasks (CHRONOODD: 0.37; CHRONOREVERSE: 0.64)suggesting the bottleneck is partly instruction following rather than visual perception. Benchmark, code, and predictions will be released upon acceptance.