Knowledge-Preserved Model Tuning in Null-Space for Robust Spatio-Temporal Video Grounding
2026-06-02 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors focus on improving video understanding when videos are of mixed quality, like some being clear and others blurry. They propose a new method called Null-Space Tuning (NST) that cleverly adjusts input features without messing up what the original model already knows. NST uses mathematical properties to fix low-quality videos while keeping high-quality ones unchanged. Their experiments show this approach works better than previous methods on videos with varying quality.
Spatio-Temporal Video GroundingLow-Quality VideoNull-SpacePre-trained ModelsResidual LearningLoRA (Low-Rank Adaptation)Feature InjectionModel AdaptationQuality-Adaptive UnitDual-Space Reparameterization
Authors
Haoxuan Chen, Xianqin Liu, Jian-Fang Hu
Abstract
Spatio-Temporal Video Grounding aims to localize object tubes based on textual queries. While recent methods have achieved remarkable success, they mainly focus on high-quality(HQ) inputs, neglecting the widespread presence of low-quality(LQ) videos in real-world scenarios. Although tuning methods like LoRA can adapt to degraded inputs, they inevitably disrupt pre-trained knowledge. To address this, we propose Null-Space Tuning (NST). This framework exploits the geometric property that adding vectors within the null-space of frozen weights to the layer input does not affect the output. Leveraging this, NST injects learnable residuals into input features that can be selectively invisible to the pre-trained backbone. Specifically, NST combines the Quality-Adaptive Unit and Dual-Space Reparameterization to synthesize these residuals by confining components for HQ inputs to the null-space, while directing restoration components for LQ inputs to the non-null space. As the frozen weights eliminate null-space components, we effectively rectify degraded inputs while preserving pre-trained knowledge for HQ inputs. Extensive experiments show that NST outperforms state-of-the-art methods on our Mixed-Quality benchmark.