Learning Implicit Bias in Generative Spaces for Accelerating Protein Dynamics Emulation

2026-06-01Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors propose a method to improve generative models that simulate protein movements by encouraging them to explore new structures instead of repeating familiar ones. They do this by adding a memory-based bias to the pretrained model, which pushes the model away from previously seen protein shapes. To make sure these new structures still make sense, they use a refinement step that corrects any drift from realistic protein forms. Their tests show this approach increases diversity in generated data and finds important protein states much faster than the original models.

protein dynamicsgenerative emulatormolecular dynamicsenhanced samplingscore-based modelingreverse-time samplingdata manifoldzero-shot learningprotein foldinglow-energy states
Authors
Kaihui Cheng, Zhiqiang Cai, Wenkai Xiang, Zhihang Hu, Siyu Zhu, Tzuhsiung Yang, Yuan Qi
Abstract
Generative emulators of protein dynamics produce plausible trajectories at a fraction of the cost of molecular dynamics, but they inherit their training distribution and tend to revisit known states rather than reach rare ones under long-horizon extrapolation. Inspired by classical enhanced sampling, we introduce an implicit, history-dependent bias in the generative space of a pretrained emulator. Specifically, a history-aware score estimator augments the frozen emulator with a distance-weighted bias that steers reverse-time sampling away from previously generated structures, regularized by an environment-support term. To preserve structural validity at long horizons, a score-based refinement step re-projects drifted samples onto the data manifold using the frozen emulator. Our experiments demonstrate that the method (i) raises diversity by $35\%$ on DynamicPDB-80; (ii) on $12$ zero-shot Fast-Folding proteins, the learned bias alone reaches the unbiased emulator's coverage up to ${\sim}15\times$ faster, and pairing it with refinement reaches the coverage up to ${\sim}37\times$ faster while covering ${\sim}3\times$ as many low-energy states. Code will be released soon.