HarmoVid: Relightful Video Portrait Harmonization

2026-05-27Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed a method to make a video’s foreground lighting look like it naturally fits into a different background by adjusting shadows and colors smoothly over time. They point out that it’s very hard to get data showing the same video under different lighting, so they improved existing methods to reduce flickering and unstable lighting changes from frame to frame. Their approach uses a special model to fix lighting inconsistencies and a clever technique to get clean edges around objects. Tests show their method produces more stable and realistic videos compared to previous techniques.

video harmonizationlighting adjustmentshadow renderingtemporal coherencevideo diffusion modeldeflickeringalpha mask conditioningrelightingforeground-background blending
Authors
Jun Myeong Choi, Jae Shin Yoon, Luchao Qi, Roni Sengupta, Joon-Young Lee
Abstract
We present a method for harmonizing the lighting of a foreground video to match a target background scene, adjusting shadows, color tone, and illumination intensity (relightful harmonization). Unlike images, acquiring labeled data for videos, where identical motions are recorded under different lighting conditions, is practically infeasible and non-scalable. While one way to create such paired data is to apply existing image-based harmonization models frame by frame to a video, the resulting outputs often suffer from significant temporal jitters. We overcome this problem by introducing a novel lighting deflickering model that can stabilize the global and local lighting flickering artifacts. Our video diffusion model learns from these upgraded deflickered data with a volume of real and synthetic videos to generate high-quality video harmonization results. We further propose an asymmetric alpha mask conditioning technique to learn the clean boundaries from real videos. Experiments demonstrate that our model achieves strong temporal coherence, naturalness, cleaner boundaries, and physically meaningful lighting behavior, while maintaining strong relighting expressiveness compared to prior image-based and video-based harmonization methods.