LiFT: Local Search via Linear Programming for Overfitting-Controlled Transformers

2026-06-15Machine Learning

Machine LearningComputation and Language
AI summary

The authors present a new method called LiFT that helps improve how transformer models like GPT-2 are fine-tuned for specific tasks, especially to avoid overfitting (when a model learns too much from training data and performs poorly on new data). Their approach uses linear programming to find the best small updates for both the model and its regularization settings by looking at information from early training steps. This method makes fine-tuning more precise and efficient, rather than relying on trial-and-error hyperparameter tuning. Experiments showed that LiFT improved the model's performance on language tasks, particularly when overfitting was likely. The authors also connect their approach to broader mathematical optimization ideas.

Transformer modelsFine-tuningOverfittingLinear programmingBilevel optimizationRegularizationGradientHessianLocal searchGPT-2
Authors
Abhishek Shukla, Anikeit Khanna, Ankur Sinha, Faiz Hamid
Abstract
This paper proposes a Linear Programming (LP)-based local search framework for fine-tuning pretrained transformer models with explicit control against overfitting. The approach formulates transformer fine-tuning as a bilevel optimization-based regularization problem, in which model parameters and regularization hyperparameters are jointly updated. Information collected during initial warm-up iterations, including validation gradients and training Hessian information, is used to construct a local descent direction by solving an LP that minimizes a scaled directional derivative while preserving training optimality. This validation-aware descent direction enables focused local updates of both parameters and regularization hyperparameters, reducing overfitting without requiring repeated full retraining cycles. The resulting method, termed Linear Programming-based Fine-Tuning (LiFT) for transformers, differs from conventional fine-tuning by systematically identifying task-specific updates rather than relying on heuristic or grid-based hyperparameter selection. Experiments on GPT-2 Small fine-tuned on WikiText-2 demonstrate that LiFT enables effective adaptation through selective tuning of transformer blocks and regularization parameters, yielding consistent improvements in test perplexity across multiple layer configurations and regularization settings, with particularly pronounced gains in overfitting-prone scenarios. Beyond empirical performance, LiFT establishes a principled connection between transformer fine-tuning, bilevel optimization, local search, and regularization theory.