Building Multi-Task Agentic LLMs via Two-Phase Distillation

2026-06-29 • Machine Learning

Machine Learning

AI summaryⓘ

The authors explore how to create AI models that can handle many tasks by first training expert models on each task separately. They find that combining these experts into one model using off-policy distillation leads to problems because the model tries to cover too many different behaviors at once. Using on-policy distillation helps but needs a good starting point to work well. To solve this, the authors suggest a two-step method: first use off-policy distillation and then refine the model with on-policy tuning. Their tests show this combined approach works better than using either method alone and matches the performance of individual expert models.

artificial general intelligencereinforcement learningmulti-task learningoff-policy distillationon-policy distillationforward KL divergencebehavioral modesmodel distillationconversational agentstext-based games

Authors

Huaijie Wang, Shusheng Xu, Yi Wu, Kaifeng Lyu

Abstract

A key step toward artificial general intelligence is to train models that can perform multiple tasks. In this paper, we study how to build such models by first training separate RL experts for individual tasks and then consolidating them via distillation, as an alternative to directly training a single model on mixed tasks. We show that off-policy distillation degrades in multi-task settings due to the mode-covering nature of forward KL: aggregating data from multiple tasks introduces a large number of behavioral modes that can exceed the student's capacity, forcing it to average across behaviors and leading to degraded performance. In contrast, on-policy distillation is mode-seeking but requires strong initialization. Inspired by these observations, we propose a two-phase approach: off-policy distillation followed by on-policy refinement. Evaluation across conversational agents and text-based games confirms that this two-phase approach matches single-task RL expert performance for each individual task, whereas off-policy or on-policy distillation alone fails to match this performance.

View PDFOpen arXiv