How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws
2026-05-25 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors studied how to best use high-quality data when training large language models, especially how to schedule it with batch size over time. They found two main ways high-quality data helps: early on it boosts useful signal by using smaller batches, and later it reduces noise by increasing batch size. Existing methods mostly use the second way but miss the first. Based on this, the authors propose a new training schedule called Drop-Stable-Rampup, which changes batch size around quality transitions and shows improved accuracy on tests, especially math benchmarks.
large language modelstraining dynamicsbatch sizedata qualityfunctional scaling lawscurriculum learningsignal-to-noise ratioDrop-Stable-RampupMixture-of-Expertsmathematical reasoning benchmarks
Authors
Zhitao Zhu, Xili Wang, Shizhe Wu, Jiawei Fu, Xiaoqing Liu
Abstract
High-quality data is scarce in large language model (LLM) training, yet how to schedule its use jointly with training dynamics lacks theoretical guidance. We extend functional scaling laws by incorporating a data-quality dimension, and solve the joint data-quality and batch-size scheduling problem in asymptotic closed form. The solution reveals two regimes and a dual role of high-quality data. In the noise-limited regime, high-quality data should be used as a signal amplifier: lowering the batch size converts cleaner data into more signal without amplifying noise. In the signal-limited regime, it should be used as a noise suppressor: late placement reduces terminal noise without sacrificing signal accumulation. Existing curriculum-style pipelines primarily exploit the second role by placing cleaner data late, but miss the first role because conventional decay schedules reduce update intensity exactly when high-quality data becomes available. Guided by this, we propose Drop-Stable-Rampup for LLM midtraining: upon the quality transition, drop the batch size, hold it stable to accumulate signal, then ramp up to suppress terminal noise. On a 15B Mixture-of-Experts model midtrained on 108B tokens, Drop-Stable-Rampup improves average accuracy over Warmup-Stable-Decay (WSD) by +1.70 and over Cosine-decay by +2.98, with particularly large gains on mathematical reasoning benchmarks such as GSM8K (+4.23) and MATH (+2.80).