AI summaryⓘ
The authors explore how automated machine learning can improve molecular property predictions by not just fitting models to fixed data but by actively changing features, models, and adding external information. They test these changes on many prediction tasks and find that improvements seen during training often do not fully carry over to new, unseen data. They also show that using carefully filtered external data can help in some cases, but avoiding overlap with test data is crucial to judge true gains. Their approach outperforms standard automated methods and competes well with large pretrained models. This study highlights the importance of validating automated research methods on separate, untouched data to confirm real improvements.
Automated machine learningMolecular property predictionValidation vs test performanceExternal evidenceFeature engineeringModel selectionData contaminationBenchmark datasetsClosed-loop systemPretrained models
Authors
Jingjie Ning, Xiaochuan Li, Ji Zeng, Chenyan Xiong, Guolin Ke
Abstract
Closed-loop Auto Research extends automated machine learning from fixed-dataset fitting to changing the research workflow, with language-model agents editing representations and model code and acquiring external evidence. Molecular property prediction spans many small endpoints. We ask whether this action space yields improvements generalizing beyond the validation signal selecting them. We isolate three Auto Research axes, features, models, and external evidence, under a file-level ablation lock attributing each gain to one axis over a strong baseline. Across 36 endpoints in three benchmark suites we score each selected configuration once on a held-out test whose labels the search never read. A routed pipeline taking each endpoint's best validation axis reaches positive held-out gains of 0.013, 0.011, and 0.042, the transferable axis differing by suite, data on TDC, model on Polaris, feature and model on MoleculeNet. The largest model-search gain falls from 0.041 on validation to 0.003 on test, while curated data reaches 0.022 but negative 0.019 on test, two non-transfer signatures. Curated external data raises held-out CYP2C9-substrate performance by 0.17 and half-life by 0.08, admitted through a contamination filter rejecting same-source files overlapping 64 to 89 percent of test structures, necessary but not sufficient for transfer. A matched-trial automated machine learning control did not reproduce the agent's code-level model intervention, reaching 0.006 against 0.042, and the pipeline stays competitive with an 84M-parameter pretrained 3D model on the shared training split. The experiments stay within molecular property prediction, but separating discovery from held-out certification is a domain-agnostic lesson for any closed-loop system optimising a proxy for a held-out quantity.