Can LLM Coding Agents Reason About Time Series?
2026-06-15 • Computation and Language
Computation and Language
AI summaryⓘ
The authors studied how large language models (LLMs) can analyze time series data, which is common but tricky in areas like finance and healthcare. They tested three methods: giving the model raw data, letting it write and run code to explore data, or both. Their results show that models using coding to explore data performed better than just reading raw numbers, but they still made mistakes about 22-34% of the time. The authors also found that while coding agents pick suitable tests, they sometimes miss details, and raw-data models solve some problems with simple calculations.
Large Language ModelsTime Series DataAutomated Decision-MakingPython Coding AgentStatistical TestsBenchmarkingData AnalysisModel EvaluationReasoning GapsEnvironmental Monitoring
Authors
Filip Rechtorík, Ondřej Dušek, Zdeněk Kasner
Abstract
Large language models (LLMs) are increasingly being used for automated decision-making systems in finance, healthcare, or environmental monitoring. Time series data are ubiquitous in these fields, yet hard to process automatically. Can time series be analyzed by LLM agents? We examine three approaches: providing the agent with raw numerical data, using the LLM as a coding agent, or a combination of both. In the coding agent setup, the model iteratively queries the data using Python code. Using two time series understanding benchmarks, we show that agents with code access can outperform models processing raw data by up to 10%. However, even the best performing agent still answers about 22-34% of the questions incorrectly. To get insights into models' strategies and reasoning gaps, we analyze the model outputs with a strong LLM judge. Our analysis reveals that coding agents can select appropriate statistical tests, but often miss important nuances. Meanwhile, models with access to raw data can reach the right conclusions using back-of-the-envelope calculations.