Data Augmentations for Data-Constrained Language Model Pretraining

2026-06-15Machine Learning

Machine LearningArtificial IntelligenceComputation and Language
AI summary

The authors studied how to improve training language models when there isn't much new text data available but there is enough computing power. They found that usual training methods tend to overfit early and then get worse when the same data is used many times. To fix this, they tried different types of data changes, like swapping words randomly or changing word order, to make the model learn better without overfitting. Their results show these data changes help the model train longer on the same data and improve overall performance.

language model pretrainingautoregressive modeloverfittingdata augmentationtoken-level noisesequence permutationvalidation lossmulti-epoch trainingcompute capacitydata-efficient learning
Authors
Michael K. Chen, Xikun Zhang, Zhen Wang
Abstract
As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction ($x_{t+i}$ for $i > 1$). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime. All code and data are available at https://github.com/michaelchen-lab/data-augmentations-for-pretraining