LitXBench: A Benchmark for Extracting Experiments from Scientific Literature

2026-04-08Information Retrieval

Information Retrieval
AI summary

The authors created LitXBench, a system to test how well computers can pull out detailed experiment information from scientific papers about materials. They also made a specific test set called LitXAlloy with data from 19 papers on metal mixtures called alloys. Unlike usual formats, their data is saved as Python objects to make checking and using it easier. They found that new advanced language models do a better job at extracting this information than older methods. This improvement seems to come from better linking measurements to the exact processing steps the materials went through, not just their composition.

Materials ScienceProperty PredictionExperimental Data ExtractionBenchmarksLanguage ModelsAlloysData ValidationPython ObjectsF1 ScoreInformation Extraction
Authors
Curtis Chong, Jorge Colindres
Abstract
Aggregating experimental data from papers enables materials scientists to build better property prediction models and to facilitate scientific discovery. Recently, interest has grown in extracting not only single material properties but also entire experimental measurements. To support this shift, we introduce LitXBench, a framework for benchmarking methods that extract experiments from literature. We also present LitXAlloy, a dense benchmark comprising 1426 total measurements from 19 alloy papers. By storing the benchmark's entries as Python objects, rather than text-based formats such as CSV or JSON, we improve auditability and enable programmatic data validation. We find that frontier language models, such as Gemini 3.1 Pro Preview, outperform existing multi-turn extraction pipelines by up to 0.37 F1. Our results suggest that this performance gap arises because extraction pipelines associate measurements with compositions rather than the processing steps that define a material.