PG-MAP: Joint MAP Optimization for Inference-Time Alignment of Diffusion and Flow-Matching Models

2026-06-22 • Machine Learning

Machine LearningComputer Vision and Pattern Recognition

AI summaryⓘ

The authors propose PG-MAP, a new way to improve how pretrained text-to-image models generate images by adjusting both the input prompt and the model’s internal state together, rather than separately. This joint adjustment helps the model better follow prompts and produce higher-quality images without needing extra training. Their method works well across different types of generative models and improves on existing techniques, as shown by better scores and human preferences. They also find that the best focus between adjusting inputs or internal states varies depending on the prompt, suggesting room for further improvements.

text-to-image modelsinference-time alignmentclassifier-free guidancediffusion modelsflow-matching modelslatent variablesconditioningGibbs-MAP optimizationreward-guided generationprompt optimization

Authors

Ruolan Sun, Pawel Polak

Abstract

Inference-time alignment of pretrained text-to-image models is typically performed along a single control axis, such as classifier-free guidance, attention editing, or reward-based latent perturbations. This limitation prevents modeling joint dependencies between conditioning and latent variables and hinders transfer across generative transports. We propose PG-MAP, a training-free framework that formulates inference-time alignment as a trajectory-level Gibbs-MAP / proximal energy optimization over the conditioning $c$ and latent state $z_t$ via a forward-consistency coupling, optionally guided by a frozen preference reward. This joint formulation enables coordinated updates across modalities while remaining compatible with both diffusion and flow-matching models through transport-specific adaptations. Across diffusion backbones (SD~1.5, SDXL), PG-MAP consistently improves alignment metrics such as PickScore and Aesthetic, and can be effectively combined with tuned classifier-free guidance to achieve the strongest overall performance. On flow-matching models (SD3.5-medium), the framework reduces to a latent-only variant, achieving $\mathbf{91.9\%}$ PickScore and $75.7\%$ HPS win rates against a static baseline, with controlled experiments ruling out noise-related artifacts. Human evaluations further confirm consistent preference over strong baselines, including tuned CFG and compute-matched universal guidance. Finally, an oracle-routing analysis shows that the relative importance of conditioning and latent optimization depends on prompt types, surfacing further headroom that a per-prompt selector could exploit.

View PDFOpen arXiv