DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

2026-05-25Machine Learning

Machine Learning
AI summary

The authors created DiscoverPhysics, a test where AI models explore new, made-up physics rules in simulated worlds different from real physics. The AI runs experiments, watches what happens, and tries to explain the physics both in words and code. This test checks if AIs can think deeply over many steps and improve their guesses based on results. They found that even the best AI models struggle with some worlds, especially when hidden complexities are involved, and that being good at predicting isn't the same as truly understanding the physics. Also, commercial AI models performed better than open-source ones in both designing experiments and figuring out the rules.

Law of motionLarge language models (LLMs)N-body simulationExperimental designPhysics benchmarksTrajectory mean squared error (MSE)Hypothesis refinementDark matterTime-varying interactionsConceptual understanding
Authors
Matt L. Wiemann, Lindsay M. Smith, Peter Melchior, Siddharth Mishra-Sharma, Andrew Gordon Wilson, Pavel Izmailov, Carolina Cuesta-Lázaro
Abstract
Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose physics deliberately deviates from our own. We construct 22 worlds governed by, among others, screened and fractional-power gravity, multi-species couplings, hidden dark-matter-like particles, non-coordinate-free physics, and time-varying interactions. Each world is generated on demand by an N-body simulator, for which the agent proposes several rounds of experiments, observes raw trajectory data, and ultimately submits both a natural-language explanation of the world's physics and a Python implementation of the inferred law. Because solving a world requires the agent to design informative experiments and revise its hypotheses, the benchmark probes long-horizon reasoning over an experimental history. We evaluate submissions along two complementary axes: trajectory MSE on held-out particles and an LLM-judged explanation score following an expert-written rubric assessing conceptual understanding of each world. Across eleven frontier models, we find that the strongest agents pass only half of the worlds and consistently fail on those where latent structure must be uncovered. Open-source models lag substantially behind commercial models, both in their ability to design informative experiments and in extracting conclusions from the data. We further find that good predictive accuracy does not guarantee high explanation quality and that conceptual understanding depends on hypothesis refinement through well-chosen experiments.