Self-Distilled Agentic Reinforcement Learning

2026-05-14 • Machine Learning

Machine LearningArtificial IntelligenceComputation and Language

AI summaryⓘ

The authors explain that reinforcement learning (RL) helps large language models (LLMs) improve over long tasks but gives only rough feedback. They build on a method called On-Policy Self-Distillation (OPSD) that gives more detailed token-level guidance but has problems when used for multi-turn conversations. Their new method, SDAR, combines RL with OPSD by carefully filtering the detailed guidance to avoid instability and bad advice. Tests on several benchmarks show that SDAR works better than previous methods and is more stable across different model sizes.

Reinforcement LearningLarge Language ModelsOn-Policy Self-DistillationToken-level GuidanceMulti-turn InteractionModel StabilityAuxiliary ObjectivePrivileged ContextALFWorldGRPO

Authors

Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

Abstract

Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.

View PDFOpen arXiv