Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research

2026-06-01Computational Engineering, Finance, and Science

Computational Engineering, Finance, and Science
AI summary

The authors created a test called Matter to Mechanism to check how well AI systems can solve real scientific problems in materials science, especially for improving batteries. Their test gives AI a problem and asks it to come up with a believable solution along with detailed explanations and key scientific info. They also made new ways to measure how good the AI's reasoning and ideas are, beyond just comparing text similarity. When they tried different AI systems, they found differences that usual tests missed and showed their new measures are harder to fool.

AI co-scientistmaterials sciencebattery researchmechanism hypothesisfailure modeinterventionreasoning tracebenchmarkevaluation metricsproblem decomposition
Authors
Shashwat Sourav, Tanjin. He, Maria K. Y. Chan, Anubhav Jain, Tirthankar Ghosal
Abstract
AI co-scientists are increasingly used for scientific discovery, but current evaluations still do not test them on a key task: moving from a concrete scientific or technological problem to a plausible, mechanism-grounded solution hypothesis. This gap is especially important in materials science and, in particular, battery research, where a useful proposal must identify the relevant failure mode, propose a credible intervention, and explain why that intervention should improve the target property. We introduce Matter to Mechanism, a benchmark for evaluating AI co-scientists on problem-to-hypothesis reasoning in materials science, with a focus on battery materials research. The benchmark contains 2,645 instances derived from scientific publications. Each instance includes a structured problem statement, a candidate solution hypothesis, an explicit reasoning trace, and domain-grounded annotations such as material system, component, failure mode, intervention, mechanism, target property, and claimed outcome. We also introduce a metric suite that measures reasoning fidelity, problem alignment, mechanistic specificity, novelty, plausibility, and problem decomposition quality, and combine them into a composite score. Using this framework, we evaluate several AI co-scientist systems and show that Matter to Mechanism reveals interpretable system differences that are only partially recovered by standard text-similarity metrics. We further show through adversarial stress tests that the aggregate score is more stable than individual metric dimensions under superficial gaming attacks.