Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data

2026-05-11 • Computation and Language

Computation and Language

AI summaryⓘ

The authors studied how to make large language models better at learning from messy web data, which usually contains lots of noise. They introduced an extra early training step using synthetic data with a special time-related pattern before the usual training. This step helped the models ignore the noisy parts more effectively during regular training, especially when there was more noise. Their method also saved a lot of natural data during training without losing quality. They found that this early step changes how the model learns over time rather than blocking noise immediately.

large language modelspre-trainingsynthetic datadata noiseattention mechanismrobustnessoptimization trajectorytokencorruptionmodel training

Authors

Xu Guo, Runyu Peng, Jian Tong, Yunhua Zhou, Haijun Lv, Zhihui Lu, Qipeng Guo

Abstract

Large language models (LLMs) rely on web-scale corpora for pre-training. The noise inherent in these datasets tends to obscure meaningful patterns and ultimately degrade model performance. Data curation mitigates but cannot eliminate such noise, so pre-training corpora remain noisy in practice. We therefore study whether a lightweight pre-pre-training (PPT) stage based on synthetic data with learnable temporal structure helps resist noisy data during the pre-training (PT) stage. Across various corruption settings, our method consistently improves robustness to noise during PT, with larger relative gains at higher noise levels. For a 1B-parameter model, a synthetic PPT stage with only 65M tokens achieves the same final loss as the baseline while using up to 49\% fewer natural-text PT tokens across different noise levels. Mechanistic analyses suggest PPT does not immediately suppress attention to noisy tokens. Rather, PPT-initialized models gradually downweight attention between corrupted tokens during noisy PT. This indicates that synthetic PPT inhibits noise self-modeling and shapes the subsequent optimization trajectory. Code is available at https://github.com/guox18/formal-language-prepretraining.

View PDFOpen arXiv