Generating Statistical Charts with Validation-Driven LLM Workflows

2026-05-01Machine Learning

Machine Learning
AI summary

The authors created a step-by-step process to help large language models (LLMs) generate accurate and easy-to-understand charts from tables of data. Their method breaks down chart creation into parts like choosing the right data, making the chart code, checking the result, and writing questions about the chart. They used this process on many datasets, producing thousands of charts with matching code and questions to test how well multimodal LLMs understand charts. Their work shows that some types of questions are easy for these models, but others that need deeper reasoning are still hard.

Large Language ModelsData VisualizationChart GenerationMultimodal LearningDataset ScreeningCode SynthesisRendered Output ValidationQuestion-Answer PairsUCI DatasetsChart Semantics
Authors
Pavlin G. Poličar, Andraž Pevcin, Blaž Zupan
Abstract
Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely provide fully aligned artifacts, such as executable code, dataset context, and question-answer pairs. We present a structured LLM-based workflow that decomposes chart generation into dataset screening, plot proposal, code synthesis, rendering, validation-driven refinement, description generation, and question-answer generation. By incorporating rendered-output validation, the workflow addresses visualization-specific failure modes such as readability and semantic mismatch. It treats chart generation as an inspectable process rather than a one-shot prompt-to-code task, retaining each chart with its code, dataset context, description, and question-answer pairs. Applied to UCI datasets, the workflow produces 1,500 charts from 74 datasets, spanning 24 chart families and paired with 30,003 question-answer pairs. We evaluate 16 multimodal LLMs (MLLMs) on these chart-question pairs. The results show that chart-syntax questions are nearly saturated, while value extraction, comparison, and reasoning remain more challenging, illustrating the workflow's utility for diagnostic studies of chart-grounded multimodal reasoning.