When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

2026-06-29Machine Learning

Machine Learning
AI summary

The authors study online imitation learning (IL), a method where a model learns by interacting and imitating experts. They question the common belief that the main benefit of online IL comes from fixing mistakes made during learning. Instead, they show that whether online IL helps depends on if the model can perfectly represent the expert's behavior (realizability). When it can, offline learning works just as well, but when it can't, online IL overcomes fundamental limits and performs better despite differences between the model and expert.

Imitation LearningOnline LearningOffline LearningOn-Policy DistillationRealizabilityModel MisspecificationDistributional MismatchError AccumulationInformation-Theoretic BottleneckLarge Language Models
Authors
Huaqing Zhang, Jingchu Gai, Juno Kim, Bingbin Liu, Andrej Risteski
Abstract
Online imitation learning (IL), particularly on-policy distillation, has emerged as a strong LLM post-training approach, often outperforming offline supervised fine-tuning (SFT). Yet a principled understanding of when and why online interaction helps remains unclear. In this work, we challenge the view that error accumulation is the main source of online IL's advantage, and instead show that the benefits of online interaction depend critically on whether the setting is realizable, i.e., whether the student policy class can represent the expert policy. Under realizability, we empirically find that offline IL already matches expert performance. In contrast, in non-realizable (misspecified) settings, we prove that offline IL encounters an information-theoretic bottleneck even when horizon $H=1$, and propose a structural characterization of misspecification relative to the reward, under which online IL provably achieves high performance despite a large distributional mismatch between the expert and student policies.