Why Do Time Series Models Need Long Context Windows?
2026-06-01 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors explain that forecasting multiple time series has two main goals: understanding what process created the data (generative process identification) and predicting future values based on that data (conditional forecasting). They argue that using longer input windows helps because it reduces uncertainty about which process is generating the series, not just because it captures long-range patterns. They prove that to get the best predictions, you need input windows longer than the actual memory of the process. Their experiments show that separating the two goals can make forecasting more efficient without losing accuracy.
time series forecastinggenerative process identificationconditional forecastinginput window sizelong-range dependenciesmemory lengthdeep learning modelscomputational scalabilityforecasting accuracy
Authors
Luca Butera, Giovanni De Felice, Andrea Cini, Cesare Alippi
Abstract
Modern deep learning models for forecasting groups of time series rely on increasingly longer observation windows. However, the benefit of increasing the window size is often simply attributed to capturing long-range dependencies, and broader discussion on how global forecasting models leverage input observations has been limited. In this paper, we show that forecasting groups of time series involves two objectives: (i) generative process identification (GPI), i.e., inferring the specific process generating the input sequence, and (ii) conditional forecasting (CF), i.e., predicting future values given input observations. From this perspective, optimal predictions can be interpreted as an average over plausible data-generating processes, weighted by their likelihood given the input window. This suggests another explanation for the benefits of long context windows: they reduce the uncertainty about which specific process is generating the input time series during operation. We prove that even for processes with memory length $P$, an input window size strictly larger than $P$ is necessary to achieve the minimum attainable error. Finally, we show how decoupling GPI and CF can improve computational scalability without compromising accuracy. Experiments on synthetic and real-world data validate our insights and their relevance for designing forecasting architectures.