TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

2026-06-05Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and LanguageMachine Learning
AI summary

The authors address a problem where images hold more details than their captions, making image and text features not line up well in vision-language models like CLIP. They introduce TEVI, a method that uses captions to help focus on important parts of image features by selectively keeping relevant information. Using tests with fake and real captions, they show TEVI better matches image features with the words describing them, which improves tasks like image retrieval and robustness. Their approach works especially well when captions are detailed.

Vision-language modelsCLIPimage embeddingstext embeddingsautoencodersimage-caption alignmentinformation imbalanceimage retrievalcaptioning datasetsRobustness
Authors
Sweta Mahajan, Sukrut Rao, Jiahao Xie, Alexander Koller, Bernt Schiele
Abstract
Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.