DRFusion: Drift-Resilient Temporally Consistent Infrared-Visible Video Fusion
2026-05-25 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors focus on improving how infrared and visible videos are combined so that the resulting video looks smooth over time without flickering or ghost images. Traditional methods struggle because they either stick too rigidly to shapes or treat each video frame separately, causing errors to build up. The authors propose a new approach that considers the video’s motion history and uses special techniques to keep the visual flow steady without strict alignment. Their method is trained in two stages to better handle structure and movement separately, resulting in better quality and more stable fused videos.
Infrared and visible video fusionTemporal consistencyOptical flowDiffusion modelsAutoregressive modelingSpectral filteringMotion generationLatent refinementTwo-stage training
Authors
Xingyuan Li, Haoyuan Xu, Shulin Li, Xiang Chen, Zhiying Jiang, Jinyuan Liu
Abstract
Infrared and visible video fusion is essential for achieving comprehensive perception in dynamic scenes. However, maintaining temporal consistency remains a formidable challenge. Conventional methods relying on optical flow often suffer from geometric rigidity and ghosting artifacts. Moreover, standard diffusion-based fusion models typically operate in a frame-by-frame manner; when extended to autoregressive settings, they lack intrinsic temporal constraints and are prone to severe error accumulation and drifting, where minor artifacts amplify over time. To address these limitations, we propose a drift-resilient video fusion method that reformulates the task as history-conditioned motion generation. We introduce Stabilized History Guidance and Soft Temporal Anchoring to reframe temporal consistency as spectral filtering, implicitly aggregating motion dynamics without rigid alignment. Furthermore, our Decoupled Structure-Motion Adaptation strategy bridges pre-trained priors and structural constraints via two-stage training and latent refinement. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both fusion quality and temporal stability.