AI summaryⓘ
The authors study a type of image generator called Visual Autoregressive (VAR) models, which have trouble making bigger, high-resolution images because their training limits them. They identify three common problems when trying to create larger images: repeating patterns globally or locally, and loss of detail. These issues happen because different steps in the image-making process use different frequency settings, and mismatches cause the problems. To fix this, the authors introduce a new method that adjusts these frequency settings for each step without retraining, plus a way to better manage how attention spreads out as image size grows. Their approach improves image quality when making bigger pictures compared to older methods.
Visual Autoregressive (VAR) modelsimage synthesisresolution extrapolationRoPE frequency bandsautoregressive generationattention mechanismsentropy normalizationcalibrationcoarse-to-fine generationtraining-free extrapolation
Authors
Feihong Yan, Shaoyu Liu, Haixuan Wang, Shuai Lu, Linfeng Zhang, Huiqi Li, Xiangyang Ji
Abstract
Visual Autoregressive (VAR) models have emerged as a strong alternative to diffusion for image synthesis, yet their fixed training resolution prevents direct generation at higher resolutions. Naively transferring training-free extrapolation methods from LLMs or diffusion models to VAR yields three characteristic failure modes: global repetition, local repetition, and detail degradation. We trace them to a unified band-stage mismatch: VAR generates images in a coarse-to-fine, scale-wise process where each stage is driven by a distinct dominant RoPE frequency band, and each failure mode emerges when the dominant band of a particular stage is disrupted. Building on this insight, we propose Stage-Aware RoPE Remapping, a training-free strategy that assigns each frequency band a stage-specific remapping rule, jointly suppressing all three failure modes. We further observe that attention becomes systematically dispersed as the image resolution increases. Existing methods typically depend on predefined attention scaling factors, which are neither adaptive to the target resolution nor capable of faithfully capturing the actual extent of attention dispersion. We therefore propose Entropy-Driven Adaptive Attention Calibration, which quantifies dispersion via a resolution-invariant normalized entropy and yields a closed-form per-head scaling factor that realigns the extrapolated-resolution attention entropy with its training-resolution counterpart. Extensive experiments show that our method consistently outperforms prior resolution-extrapolation methods in both structural coherence and fine-detail fidelity. Our code is available at https://github.com/feihongyan1/ExtraVAR.