Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

2026-04-27 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors introduce Hyperparameter-Divergent Ensemble Training (HDET), a way to use multiple GPUs not just to speed up training but to try different learning rates at the same time. Their method switches between letting each GPU work independently with different rates and then averaging the results regularly. They also create an automatic controller that adjusts the learning rates based on how well each GPU performs, improving training without needing extra work or changes to the model. This approach can also explore other settings, like dropout or weight decay, making training more efficient and adaptive.

data-parallel stochastic gradient descentGPU replicaslearning rateensemble trainingAllReducelearning rate schedulerhyperparameter tuningdropoutweight decaymeta-update

Authors

Hailing Cheng, Tao Huang, Chen Zhu, Antonio Alonso

Abstract

Training large neural networks with data-parallel stochastic gradient descent allocates N GPU replicas to compute effectively identical updates -- a practice that leaves the rich space of learning rate configurations entirely unexplored during training. We propose Hyperparameter-Divergent Ensemble Training (HDET), a method that repurposes these replicas for simultaneous learning rate exploration at negligible communication overhead. HDET operates in alternating phases: a fan-out stage in which replicas train independently under a structured, symmetric spread of learning rates, and a converge stage in which parameters are averaged across all replicas via AllReduce every T steps. Building on this ensemble substrate, we further propose an automatic learning rate (auto-LR) controller that treats the relative training loss across replicas as a performance signal, updating the shared base schedule toward higher-performing configurations via a momentum-based gradient-free meta-update. The combined method produces a self-adapting learning rate schedule that improves both optimization quality and generalization without additional hyperparameter sweeps or training budget. Crucially, the framework generalizes beyond learning rate: any scalar hyperparameter that does not alter model architecture -- such as dropout rate, attention scale temperature, or weight-decay coefficient -- can be explored across replicas using the same fan-out/converge protocol, with inter-replica loss differences serving as zero-order hypergradients that guide the search direction. HDET is implemented as a drop-in replacement for PyTorch's OneCycleLR scheduler, requiring no changes to model architecture, optimizer, or data pipeline.

View PDFOpen arXiv