q0: Primitives for Hyper-Epoch Pretraining

2026-06-02Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors point out that training one model many times over is not efficient because it stops improving quickly, even if there's more computing power available. They propose a method called hyper-epoch pretraining (q0), which trains a group of diverse models instead of just one and combines their predictions for better results. Their approach uses three main ideas: varying training settings cyclically, having models learn from previous models, and smartly picking which models to use based on a validation set. Experiments show that q0 needs fewer training epochs to reach or surpass the accuracy of a large ensemble of models and can adapt well to different training budgets. This method also improves performance on other related tasks outside the initial training data.

multi-epoch trainingmodel ensemblelearning rate scheduleweight decaychain distillationvalidation lossfine-tuningdata efficiencytransfer learninggeneralization
Authors
Bishwas Mandal, Shmuel Berman, Akshay Vegesna, Samip Dahal
Abstract
Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring a population of models and aggregating their predictions. We introduce hyper-epoch pretraining (q0), which turns a multi-epoch budget into a population of diverse models whose combined predictions reach a lower validation loss than a single refined model. q0 reduces to three core primitives. A cyclic schedule with anti-correlated learning rate and weight decay collects diverse models from a few parallel trajectories. Chain distillation trains each model against its predecessor so that model quality compounds across the population. A learned prior, fit on a held out set, selects and weights members for any inference budget. On a 1.8B-parameter model trained on 100M FineWeb tokens, q0 matches a strong 256-epoch ensemble baseline using only ${\sim}56$ epochs (${\sim}4.6\times$ fewer), or ${\sim}67$ epochs (${\sim}3.8\times$ fewer) when matched to the baseline's ensemble size, and continues to improve beyond it. These gains reach cumulative ${\sim}12.9\times$ data efficiency under the Slowrun setting and transfer to downstream benchmarks. Crucially, the optimal allocation shifts with the budget, so we give prescriptive recipes for how to spend a given epoch budget to maximize generalization, from a single epoch up to the largest budgets.