Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

2026-05-28 • Artificial Intelligence

Artificial IntelligenceHuman-Computer InteractionSoftware Engineering

AI summaryⓘ

The authors studied how an AI coding assistant helped a physicist develop a complex physics module over 12 days. They found the AI could fix many coding issues on its own but struggled with deeper problems because it only tweaked parameters instead of changing the underlying code structure or concepts. The physicist had to supervise carefully, using tests and rules to catch errors the AI missed. The study highlights that trusting AI output depends more on how humans guide it than on the AI's raw ability, and improving AI would require it to rethink code design and understand scientific meaning better.

AI coding agentdifferentiable programmingone-loop perturbation theoryJAXoracle testsanisotropic BAO dampingphysics supervisionmodel architectureexplanatory correctnessCLAX-PT

Authors

Nhat-Minh Nguyen

Abstract

Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]

View PDFOpen arXiv