Stabilizing Streaming Video Geometry via Dynamic Feature Normalization

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors found that models estimating 3D depth from single images often give inconsistent results when used on video, mainly because of changes in how features are normalized over time. To fix this, they created a small add-on called Dynamic Feature Normalization (DyFN) that adjusts the feature scaling as the video progresses to keep depth predictions stable. They only need to fine-tune this small module while keeping the main model fixed, which improves time consistency without losing accuracy on single images. Their tests show DyFN reduces flickering and errors better than previous methods, even those that use more complex video information.

3D geometry estimationmonocular depth estimationtemporal consistencyfeature normalizationscale-shift driftrecurrent modulestreaming inputlatent featuresfine-tuningautonomous driving
Authors
Xiaoyang Lyu, Muxin Liu, Xiaoshan Wu, Ruicheng Wang, Yi-Hua Huang, Yang-Tian Sun, Shaoshuai Shi, Xiaojuan Qi
Abstract
Consistent 3D geometry estimation from streaming RGB input is crucial for real-world applications such as autonomous driving, embodied AI, and large-scale reconstruction. While modern monocular geometry foundation models achieve strong single-image accuracy, they exhibit severe temporal inconsistency on continuous input, notably dominated by scale--shift drifting. Through targeted empirical analysis, we trace this instability to its root cause: fluctuations in latent feature statistics, whose mean and variance directly determine the predicted depth's scale and shift. Building on this insight, we introduce Dynamic Feature Normalization (DyFN), a lightweight, causal recurrent module that dynamically and robustly modulates feature statistics to maintain stable geometry over time. We adapt powerful pretrained monocular geometry models for streaming by finetuning only DyFN, a mere 2\% additional parameters, while keeping the backbone frozen, thereby achieving temporal consistency without compromising single-image accuracy. Extensive experiments across four benchmarks show that DyFN effectively eliminates temporal artifacts such as disjointed layering and positional jitter, and achieves state-of-the-art temporal stability, improving over prior streaming methods by up to 14\% and even outperforming heavier non-causal video baselines. Project Page: https://shawlyu.github.io/DyFN