How Post-Training Shapes Biological Reasoning Models

2026-06-15Machine Learning

Machine Learning
AI summary

The authors studied how different training steps affect models that reason about biological data like DNA and proteins. They found that each training stage—continued pre-training, supervised fine-tuning, and reinforcement learning—changes the model's ability to handle familiar versus new data in different ways. Fine-tuning improves performance on known data but can hurt performance on new types of data, while reinforcement learning helps recover some of that lost ability. Overall, the authors show that careful balancing of these training steps is needed to get the best results when teaching models about biology.

language modelscontinued pre-training (CPT)supervised fine-tuning (SFT)reinforcement learning (RL)biological reasoninggenomicstranscriptomicsout-of-domain generalizationmodel overfittingfoundation models
Authors
Lukas Fesser, Hanlin Zhang, Michelle M. Li, Eric Wang, Bryan Perozzi, Shekoofeh Azizi, Sham M. Kakade, Marinka Zitnik
Abstract
Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under controlled variation in backbone, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), measuring both in-domain (ID) and out-of-domain (OOD) performance. We find that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains. CPT improves downstream performance by aligning models with biological language. SFT consistently increases ID performance but causes OOD performance to peak early and decline as models fit the training distribution. RL, when applied to strong SFT checkpoints with aligned rewards, improves OOD performance and partially recovers generalization. These results show that biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages.