Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors explain that even though generative models create realistic images, they often produce similar results repeatedly, a problem called mode collapse. They found that the usual way of starting the model with random noise doesn’t consider the important landscape of guidance, which causes the model to stick to limited outputs. Their method, DivIn, picks smarter starting noise by exploring this landscape using a technique called Langevin dynamics, which helps create more varied and accurate images. This approach works with different types of models and can be combined with other methods to get even better diversity without losing quality.

generative modelsmode collapseGaussian initializationguidance potentialLangevin dynamicsdiffusion modelsflow matchinginference-timediversity-quality tradeoff
Authors
Xiang Li, Dianbo Liu, Kenji Kawaguchi
Abstract
Despite the remarkable fidelity of generative models, they frequently suffer from mode collapse. Existing strategies for enhancing diversity predominantly focus on intervening during the generation trajectory. We identify a critical oversight that the standard Gaussian initialization often causes trajectories to collapse into dominant modes because it is agnostic to the guidance potential landscape. In this work, we formulate selecting the initial noise from a guidance potential posterior, which effectively re-weights the prior towards diversity-rich regions. To sample from this distribution efficiently, we introduce Diversity-inducing Initialization (DivIn), which leverages Langevin dynamics to actively navigate the initialization landscape, steering initial noise away from collapsing regions while anchoring them to the valid data manifold. Our method serves as an inference-time diversity enhancement compatible with both diffusion and flow matching models. Extensive experiments show that DivIn exhibits a superior performance in both class-to-image and text-to-image scenarios. Furthermore, we highlight that as DivIn is orthogonal to trajectory-based methods, combining them significantly expands the diversity-quality Pareto frontier beyond what either achieves in isolation.