Diffusion Alignment Beyond KL: Variance Minimisation as Effective Policy Optimiser
2026-02-12 • Machine Learning
Machine Learning
AI summaryⓘ
The authors look at a way to adjust diffusion models, which are used to generate data step-by-step, so they produce samples that better match certain rewards. They explain this process like a common statistical method called Sequential Monte Carlo, where weights help guide the sampling. Instead of using the usual approach based on a technical measure called KL divergence, the authors propose minimizing the variance of these weights to improve the alignment. They show mathematically that this variance approach still targets the right distribution and matches the usual method when sampling from current data. Their idea helps unify different methods and suggests new ways to improve diffusion model training.
Diffusion modelsSequential Monte CarloImportance weightsVariance minimisationReward-tilted distributionPolicy optimisationKullback-Leibler divergenceOn-policy samplingDenoising trajectoryProbability distributions
Authors
Zijing Ou, Jacob Si, Junyi Zhu, Ondrej Bohdal, Mete Ozay, Taha Ceritli, Yingzhen Li
Abstract
Diffusion alignment adapts pretrained diffusion models to sample from reward-tilted distributions along the denoising trajectory. This process naturally admits a Sequential Monte Carlo (SMC) interpretation, where the denoising model acts as a proposal and reward guidance induces importance weights. Motivated by this view, we introduce Variance Minimisation Policy Optimisation (VMPO), which formulates diffusion alignment as minimising the variance of log importance weights rather than directly optimising a Kullback-Leibler (KL) based objective. We prove that the variance objective is minimised by the reward-tilted target distribution and that, under on-policy sampling, its gradient coincides with that of standard KL-based alignment. This perspective offers a common lens for understanding diffusion alignment. Under different choices of potential functions and variance minimisation strategies, VMPO recovers various existing methods, while also suggesting new design directions beyond KL.