Online Data Selection for Instruction Tuning via Gaussian Processes

2026-06-29 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors look at how to pick the best training data when teaching large language models, focusing on quality instead of just quantity. Existing methods select data in small groups, which limits their ability to find the best overall data. They introduce GAIA, a new method that uses Gaussian Process regression to estimate the value of all available data globally and adaptively choose the most useful samples. Their approach is designed to stay effective even when data quality changes during training. Tests on multiple datasets show GAIA works better than current leading methods for efficient instruction tuning.

Large Language ModelsData SelectionGaussian Process RegressionInstruction TuningAdaptive SamplingNon-stationary DataRegret AnalysisHedge AlgorithmSemantic Space

Authors

Jun Wang, Quoc Phong Nguyen, Julien Monteil, Vu Nguyen

Abstract

With Large Language Model (LLM) pre-training and fine-tuning shifting its focus from data volume to data quality, quality data selection has emerged as a critical research topic. Existing online data selection methods for LLM training are typically "batch-constrained", limiting optimization to local utility within random batches. To overcome this, we propose GAIA (Global Adaptive Instruction tuning via GAussian processes), a framework that formulates data valuation as a global estimation process. GAIA employs Gaussian Process regression to model continuous utility manifolds across the semantic space, utilizing an adaptive strategy fusion mechanism to dynamically prioritize high-utility samples. By casting the strategy-posterior update as an instance of the classical fixed-share Hedge framework for tracking the best expert, we inherit a dynamic-regret guarantee that characterizes GAIA's robustness under non-stationary quality scores during training. Empirical evaluations on three datasets demonstrate that GAIA significantly outperforms state-of-the-art baselines like \greats, establishing our method as a scalable and robust solution for efficient instruction tuning.

View PDFOpen arXiv