SciIR: A Large-scale Training Dataset and Benchmark for Scientific Image Reasoning Generation
2026-06-29 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors created a large new dataset called SciIR-82k with over 80,000 scientific images and their descriptions to help train models that make scientific pictures. They broke down scientific reasoning into three parts: the things shown (Entity Structure), the steps or processes (Scientific Process), and the underlying rules (Scientific Law). They also designed tests, called SciIR-Bench, to check if AI models understand these parts well. Their experiments showed that current AI models struggle with scientific reasoning, but by training on their new dataset, their improved model did better at making accurate scientific images.
Text-to-Image modelsScientific image generationPeirce's Semiotic TriadEntity StructureScientific ProcessScientific LawDataset creationScientific ReasoningChain-of-ThoughtModel evaluation
Authors
Zhiyuan Ma, Zhengfeng Shi, Yuning An, Peize Li, Jiabao Wei, Ruijie Li, Junhao Xiao, Jianjun Li, Bowen Zhou
Abstract
While Text-to-Image (T2I) models have shown remarkable success in generating photorealistic visual content, they still struggle with the rigorous semantic alignment and logical reasoning required for scientific imagery. Inspired by Peirce's Semiotic Triad, we introduce Scientific Image Reasoning (SciIR), a comprehensive resource for training and evaluation of scientific image generation. We formalize scientific reasoning into three core dimensions: Entity Structure (Icon), Scientific Process (Index), and Scientific Law (Symbol). Specifically, to overcome the scarcity of training data in scientific image generation, we elaborately create SciIR-82k, a large-scale dataset containing over 80,000 high-quality scientific image-text pairs from cutting-edge publications. The dataset is hierarchically organized according to the semiotic dimensions and incorporates a Scientific Reasoning Chain-of-Thought (Sci-RCoT) to explicitly model underlying visual logic. For evaluation, we propose SciIR-Bench, which aligns with these three semiotic levels and employs an Atomic Checklist to convert the outcome-oriented scientific accuracy into process-oriented, verifiable, fine-grained questions. Our extensive experiments reveal significant deficiencies in current models' scientific reasoning capabilities. Furthermore, by fine-tuning on the SciIR-82k dataset, we developed the Qwen-Image-SciIR model, which achieves a substantial improvement on the SciIR-Bench, increasing the final score from 35\% to 43\%, laying a solid foundation for future advances in scientific image generation.