Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training

2026-05-25 • Machine Learning

Machine LearningComputation and Language

AI summaryⓘ

The authors studied how learning-rate schedules affect training small language models using quantization, which reduces model precision to save resources. They found that for most precision levels (FP16, INT8, INT6), the best learning-rate schedule does not change with bit-width, meaning one schedule fits all well. For very low precision (INT4), the optimal schedule depends on model size, being clear for models above 50M but unclear below that. They also showed that common assumptions about why learning rate might need to change with quantization don’t hold. Their practical advice is to use the same learning rate schedule for higher precisions and to pick a specific schedule for INT4 if the model is large enough.

quantization-aware traininglearning-rate schedulebit-widthlanguage modelsINT4INT6INT8FP16warmdownmodel size

Authors

Christian Brandt Thomassen

Abstract

We test whether the optimal learning-rate schedule depends on bit-width during from-initialisation quantisation-aware training (QAT) for sub-100M decoder language models. A 720-run factorial grid (Phase 2) over bit-width x warmdown fraction x LR magnitude x model size x seed (FP16/INT8/INT6, 15M-100M, 5 seeds) finds the optimal warmdown is 33% at every (bit-width, size) cell. The primary hypothesis -- that INT6 QAT requires a different schedule than higher-precision training -- is falsified at FP16/INT8/INT6. A 625-run follow-up (Phase 5) probes the null along five axes: optimiser (AdamW), schedule shape (cosine), training length (up to 9x more iterations), an extended size sweep (5M-350M), and an INT4 sweep from 3M to 100M. The null is robust under all three setup changes. The INT6 penalty follows a log-linear scaling law whose fit on Phase 2 predicts the five held-out Phase 5 sizes (5M, 8M, 175M, 250M, 350M) within their 95% prediction intervals (5/5). For INT4 the picture is sharper than the higher precisions: at 50M and 100M, wd33 is decisively optimal (paired z ~ 12-15, 10/10 seeds); below 50M, across the six tested sizes from 3M to 30M, no individual size shows a statistically significant schedule preference and the per-size mean penalty oscillates within seed-level noise. The boundary is therefore a transition between a noise-dominated regime below 50M and a decisive wd33 regime at and above 50M, not a clean wd10 region. A weight-to-grid-distance probe falsifies the simplest mechanism for the FP16/INT8/INT6 null result (rapid grid-snapping): pre-warmdown, INT6-QAT weights sit at essentially the same distance from the INT6 grid as FP16 weights (ratio ~ 1.04). Practical recommendation: at sub-100M scale, tune the LR schedule once at FP16 and apply unchanged to INT8/INT6 QAT; for INT4 at 50M+ use wd33; for INT4 below 50M the schedule choice is in the noise.

View PDFOpen arXiv