Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature
2026-06-29 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors created MatMMExtract, a tool that breaks down complex scientific figures into simpler parts and matches each part with detailed, structured descriptions. They used this tool on nearly 15,000 materials science articles, producing a large dataset called MatSciFig with hundreds of thousands of image-text pairs. To help locate figure panels precisely, they developed MaterialScope, a specialized dataset for training detection models. They also tested different language models to generate accurate annotations and showed that their dataset improves image-text retrieval much better than existing methods. All their tools and data are publicly shared to help other researchers.
materials sciencecompound figuresimage-text pairinglarge language modelspanel detectionvisualisation taxonomyYOLO detectorimage-text retrievaldataset annotationvision-language learning
Authors
Subham Ghosh, Shubham Tiwari, Mohammad Ibrahim, Abhishek Tewari
Abstract
The materials science literature encodes decades of experimental knowledge in figures, yet this visual record remains locked away and inaccessible to AI at scale. The core difficulty is structural: most scientific figures are compound, with a single caption describing multiple sub-panels simultaneously, making direct image-text pairing unreliable. We present MatMMExtract, an end-to-end open-source pipeline that resolves this by decomposing compound figures into individual sub-panels and generating structured, grounded annotations using a large language model guided by a curated materials science taxonomy. Applied to 14,810 open-access articles, MatMMExtract produces MatSciFig; 391,606 panel-level image-text pairs from 180,571 figures, each annotated with a sub-caption, a two-level visualisation category spanning 19 classes and over 100 subtypes, and a scientific summary. To enable accurate panel localisation, we introduce MaterialScope, a domain-specific detection dataset of 2,811 manually annotated materials science figures, on which a fine-tuned YOLO12-m detector achieves mAP_50 of 0.9227. Among six benchmarked language models, Gemini 3.1 Flash Lite delivers the best cost-quality trade-off for annotation generation, with 82% of outputs rated good and a hallucination rate of 4.8%. A dual-encoder retrieval baseline on MatSciFig achieves a 4.4 times improvement in R@1 over zero-shot CLIP, demonstrating the dataset's immediate utility for vision-language learning. All resources are released openly to the community.