LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution

2026-06-08 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors address the challenge of improving video quality in new domains without heavy computation by using a method called flow matching for Video Super-Resolution (VSR). They propose LiteVSR, which uses a frozen Diffusion Transformer combined with a simple adapter to enhance low-quality videos. This adapter cleverly extracts both stable and changing video features, allowing the model to improve details during the generation process efficiently. Their approach achieves good results with much fewer trainable parameters and faster training compared to traditional methods.

Video Super-ResolutionDiffusion TransformerFlow MatchingControlNetLow-Quality to High-Quality MappingState-Aware AdapterCross-AttentionDenoisingFine-TuningGenerative Models

Authors

Yu Cao, Ziquan Liu, Zhensong Zhang, Jiankang Deng, Shaogang Gong, Jifei Song

Abstract

Adapting large-scale pre-trained video generators for Video Super-Resolution (VSR) in novel domains remains computationally prohibitive. Methods that reformulate generation as direct Low-Quality to High-Quality mappings deviate from the original generative formulation, demanding extensive fine-tuning. ControlNet-style adapters lose their efficiency under modern Diffusion Transformers since the absence of encoder-decoder hierarchy forces duplication of the entire backbone. We observe that flow matching offers a principled alternative for cross-domain VSR adaptation. By predicting a constant velocity field across all timesteps, the adaptation task reduces to learning a fixed injection pattern rather than time-varying transformations. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that extracts static structural cues from the LQ input and dynamic cues from intermediate denoising states, aligning them through time-dependent cross-attention to enable adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves competitive restoration quality with only 11.25% trainable parameters and 12 GPU-hours of training on a single A100, while maintaining fast sampling (down to a single step) compatibility.

View PDFOpen arXiv