PARE: Pruning and Adaptive Routing for Efficient Video Generation

2026-05-26 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors introduce PARE, a method to make video generation with Video Diffusion Transformers faster and more efficient. They do this by carefully pruning parts of the model based on whether they handle spatial or temporal information and by using a lightweight controller to decide which parts of the model to run depending on the input and stage of processing. This allows the system to use less computing power without losing video quality. Their experiments show that PARE reduces computation while keeping good results and works well with other speeding-up techniques.

Video Diffusion Transformersmodel pruningadaptive routingattention headsspatial and temporal rolesdenoising timestepknowledge distillationcompute efficiencytext-to-video generation

Authors

Yutong Wang, Yunke Wang, Tianfan Xue, Yu Qiao, Yaohui Wang, Xinyuan Chen, Chang Xu

Abstract

Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. Recent methods reduce cost by compressing width, depth, or sampling steps, but typically commit to a fixed architecture that cannot adapt to individual inputs or denoising stages. We propose PARE (Pruning and Adaptive Routing for Efficient video generation), which jointly compresses width and depth with structure-aware pruning and input-adaptive routing. For width, we observe that attention heads specialize into spatial and temporal roles, and design importance scoring that accounts for this distinction to prevent motion-critical temporal heads from being pruned prematurely. For depth, we train a lightweight router conditioned on denoising timestep and visual content to dynamically select which blocks to execute at each step, enabling per-input compute adaptation rather than static block removal. A progressive pipeline first recovers width-pruned quality via distillation, then jointly optimizes the student and router to decouple the two learning objectives. Experiments on Wan2.1-14B for both image-to-video and text-to-video generation show that PARE substantially reduces per-step computation while preserving quality across VBench dimensions, and composes with step distillation for further acceleration.

View PDFOpen arXiv