Don't Settle at the Mode! Mitigating Diversity Collapse in Pretrained Flow Models via Feature Self-Guidance

2026-06-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors found that advanced flow-based image generation models often produce very similar images when asked to create multiple pictures from the same input, which is called diversity collapse. To fix this, they came up with a way for the model to guide itself to produce more varied features during generation, without needing extra training or extra reward systems. They also use a step to keep these features realistic by projecting them back onto the data manifold, ensuring the images stay true to the input. Their method can be easily added to existing models and improves image variety without hurting quality.

flow modelsdiversity collapselatent guidanceself-guidancedata manifoldmanifold regularizationconditional image generationinference-timefeature dispersion
Authors
Pradhaan S Bhat, Rishubh Parihar, Abhijnya Bhat, R. Venkatesh Babu
Abstract
State-of-the-art flow models generate stunning images from text or image prompts. However, they suffer from diversity collapse when generating multiple samples under the same conditioning. Existing methods address this issue via either latent guidance, which has limited effectiveness, or sample selection, which relies on external reward models that incur significant inference-time overhead. In this work, we introduce an efficient, training-free self-guidance mechanism to mitigate diversity collapse without requiring additional reward models. Specifically, we disperse the internal features of the flow model during batch generation with feature self-guidance. Further, to keep the features close to the manifold, we introduce a manifold regularization step that projects these dispersed features back onto the data manifold, ensuring diverse generation without sacrificing alignment with the input conditions. Our method integrates seamlessly as a plug-and-play module into pretrained flow models, adding only a marginal inference cost. Experiments demonstrate significant improvements in diversity while preserving fidelity across several conditional flow models, including multi-step and few-step text-to-image, depth-to-image, and reference image generation.