A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

2026-06-01Machine Learning

Machine LearningComputation and Language
AI summary

The authors studied why improving large language models (LLMs) on one task using reinforcement learning (RL) can hurt their performance on other tasks. They found that changes to the model are small and affect different parts when training on each domain, but the tasks still share key pathways inside the model where conflicts happen. Their theory shows that damage occurs mainly in a small shared subspace related to these conflicts, and briefly re-training on an earlier task can fix this damage with little harm to others. Experimental results support this, showing recovery of math skills after multi-task training through a short "refresh". They also confirm localized damage by selectively undoing changes in conflicted areas without full retraining.

reinforcement learninglarge language modelscatastrophic forgettinggradient conflictparameter updatesneuronscomputation routesmulti-domain traininglow-dimensional subspaceretraining refresh
Authors
Lei Yang, Siyu Ding, Deyi Xiong
Abstract
Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based on catastrophic forgetting or global gradient conflict are incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under a local perturbation model of multi-domain RL that later-domain training harms an earlier domain mainly through a second-order damage term, which under the observed sparse route structure concentrates in a low-dimensional shared conflict subspace. Moreover, a short domain refresh contracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code $\rightarrow$ Math $\rightarrow$ QA $\rightarrow$ CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-free rollback on a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.