TimeVista: Exploring and Exploiting Vision-Language Models as Judges for Time Series Forecasting
2026-06-15 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors address the problem that common ways to check how good time series forecasts are don’t match how humans naturally judge them. They use vision-language models (VLMs), which understand images and text together, to act like judges that look at time series plots and decide how good the forecasts are based on detailed rules. They created a benchmark called TimeVista with thousands of examples and showed that VLMs agree with human opinions better than usual metrics. They then tested advanced time series models using their new approach, showing that VLMs provide a more human-aligned and reliable evaluation.
time series forecastingpoint-wise metricsvision-language modelsTimeVista benchmarkevaluation rubricstime series foundation modelsmeta-evaluationhuman-aligned evaluationLLM-as-a-Judgecontextual judgment
Authors
Zhi Chen, Yuxuan Wang, Jialong Wu, Yong Liu, Haoran Zhang, Xingjian Su, Jianmin Wang, Mingsheng Long
Abstract
High-quality time series forecasting is pivotal for real-world decision-making. However, traditional point-wise metrics often fail to reveal complex temporal patterns and align poorly with human intuitive preferences. While the ''LLM-as-a-Judge'' paradigm has revolutionized text evaluation by providing flexible, human-aligned judgment, its application to time series remains largely unexplored. In this paper, we leverage Vision-Language Models (VLMs) as judges for time series forecasting, harnessing their ability to comprehend time series plots grounded in textual information. Specifically, we propose a novel framework integrating micro- and macro-level judgments informed by contextual information to evaluate time series forecasting. To this end, we introduce TimeVista, a comprehensive VLM-as-a-Judge benchmark comprising 5563 time series samples paired with detailed evaluation rubrics. Extensive meta-evaluations demonstrate that VLMs are highly reliable judges, achieving significantly higher consistency with human preferences than conventional metrics. Building upon our benchmark, we comprehensively assess recent Time Series Foundation Models (TSFMs) under the VLM-as-a-Judge paradigm. Our results demonstrate that VLMs serve as robust and interpretable judges, providing a comprehensive, human-aligned standard for evaluating time series models.