Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation
2026-05-11 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors tackle the problem of finding pictures when the search queries come in different styles, like sketches or low-quality images, which usual models have trouble with. They created a method called Hystar that changes parts of the model dynamically to better fit each query's style. They also introduced a new way to train the model, called StyleNCE, that helps it tell apart confusing styles more effectively. Tests showed their approach works better than existing methods while using fewer extra parameters.
Query-based image retrievalVision-language representation modelsCLIPHypernetworkSingular-value perturbationAttention layersContrastive lossOptimal transportCross-style retrievalParameter efficiency
Authors
Yujia Cai, Boxuan Li, Chenghao Xu, Jiexi Yan
Abstract
Query-based image retrieval (QBIR) requires retrieving relevant images given diverse and often stylistically heterogeneous queries, such as sketches, artworks, or low-resolution previews. While large-scale vision--language representation models (VLRMs) like CLIP offer strong zero-shot retrieval performance, they struggle with distribution shifts caused by unseen query styles. In this paper, we propose the Hypernetwork-driven Style-adaptive Retrieval (Hystar), a lightweight framework that dynamically adapts model weights to each query's style. Hystar employs a hypernetwork to generate singular-value perturbations ($ΔS$) for attention layers, enabling flexible per-input adaptation, while static singular-value offsets on MLP layers ensure cross-style stability. To better handle semantic confusions across styles, we design StyleNCE as part of Hystar, an optimal-transport-weighted contrastive loss that emphasizes hard cross-style negatives. Extensive experiments on multi-style retrieval and cross-style classification benchmarks demonstrate that Hystar consistently outperforms strong baselines, achieving state-of-the-art performance while being parameter-efficient and stable across styles.