Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning
AI summary

The authors created a new model called SKILD that can both make new images from scratch and improve blurry images in one system. They use a special method that treats image details across different scales, applying noise in a way that matches the image's frequency patterns. This approach means the model doesn't need to be changed or trained again for different image sizes or tasks. The authors showed their model works well on standard datasets and even recovers important patterns in physical systems like the Ising model.

diffusion modelscale invariancesuper-resolutionimage generationGaussian noisefrequency spectrumCIFAR-10ImageNetIsing modelperceptual metrics
Authors
Zixin Jessie Chen, Zhuo Chen, Archer Wang, Jeff Gore, William T. Freeman, Congyue Deng, Marin Soljačić
Abstract
Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical differences, both can be understood as reversing information loss across scales. We introduce $\textbf{SKILD}$, a $\textbf{S}$cale-invariant $\textbf{K}$-Space $\textbf{I}$mage $\textbf{L}$earning $\textbf{D}$iffusion model that unifies generation and continuous super-resolution within a single unconditional framework. Both natural images and critical physical systems exhibit scale invariance, and we leverage it to design a forward process that attenuates image content from fine to coarse scales while injecting spectrum-matched Gaussian noise, making scale an explicit coordinate of the diffusion dynamics. The same trained reverse process performs generation and continuous super-resolution by varying only the starting timestep: $\textit{no task-specific architecture, no conditioning branch, no classifier-free guidance, no retraining per scale factor}$. Empirically, SKILD reaches FID $2.65$ and Inception Score $9.63$ on unconditional CIFAR-10, performs $2\times$--$8\times$ super-resolution on ImageNet from a single unconditional checkpoint while outperforming conditional models across perceptual metrics, and reconstructs critical Ising models whose connected four-point correlations closely track the ground truth.