Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion

2026-05-04Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed ActDiff-VC, a video compression method that works well at very low bitrates by using a special kind of AI model called a diffusion model. Instead of sending all video frames, their system sends keyframes only when necessary and tracks a few important moving points to help recreate the missing frames. They introduced ways to pick these keyframes and points efficiently to keep the video quality good while using very little data. Tests show their method reduces file size a lot and improves the visual quality compared to other advanced methods when compressing videos extremely tightly.

diffusion modelsvideo compressionkeyframesbitrateconditional generationtemporal dynamicspoint trajectoriesperceptual qualityrate-distortion trade-offgenerative models
Authors
Amirhosein Javadi, Shirin Saeedi Bidokhti, Tara Javidi
Abstract
Diffusion models provide a powerful generative prior for perceptual reconstruction at ultra-low bitrates, but effective video compression requires controlling the generative process using highly compact conditioning signals. In this work, we present ActDiff-VC, a diffusion-based video compression framework for the ultra-low-bitrate regime. Our method partitions videos into variable-length segments, transmits keyframes only when needed, and summarizes temporal dynamics using a compact set of tracked point trajectories. Conditioned on these sparse signals, a conditional diffusion decoder synthesizes the remaining frames, enabling perceptually realistic reconstruction under severe rate constraints. To support this design, we introduce two mechanisms: content-adaptive keyframe selection and budget-aware sparse trajectory selection, which together enable compact yet effective conditioning for generative reconstruction. Experiments on the UVG and MCL-JCV benchmarks show that ActDiff-VC achieves up to 64.6\% bitrate reduction at matched NIQE, improves KID by up to 64.6\% and FID by up to 37.7\% at comparable bitrates against strong learned codecs, and delivers favorable perceptual rate--distortion trade-offs relative to learned and diffusion-based baselines in the ultra-low-bitrate regime.