Beyond Compression: Quantifying Spectral Accessibility in Vision Representations

2026-06-02Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors studied how vision-language models change visual information when converting images into a shared space with text. They looked at how well different parts of the image's frequency information can be recovered at various layers of the model, using a new method called Residual Spectral Loss to control for simple size changes in the data. Their experiments showed that inside these models, the ability to access certain frequency details rises and falls, peaking in the middle layers. They also found that how the final layer processes information varies by model type, with some preserving frequency details differently. Overall, the authors highlight that middle layers and the final pooling steps play key roles in shaping what visual details are kept or lost.

vision-language modelsembedding spacespatial-frequencyFourier transformlinear recoverabilityResidual Spectral Lossdimensionality reductionCLIPDINOv2representation learning
Authors
Akayou A. Kitessa, Yijun Zhao
Abstract
Vision-language models map visual features into a shared embedding space through learned projection layers, yet it remains unclear how these transformations alter the structure of visual information. This study examines changes in representation through spatial-frequency accessibility, measured by the linear recoverability of band-limited Fourier energy from model representations. To isolate effects beyond dimensionality reduction, we introduce Residual Spectral Loss (RSL), which evaluates changes relative to a dimension-matched random projection baseline. To reduce confounding effects from optimization, the analysis uses pretrained models with all parameters frozen. The experimental results show consistent frequency-dependent changes in accessibility across CLIP and DINOv2 on ImageNet and MS-COCO datasets. Spectral accessibility follows a non-monotonic trajectory across depth, peaking at intermediate layers before decreasing toward the output representation. The final transformation differs across architectures: CLIP's learned projection is spectrally neutral, with changes explained by compression, whereas DINOv2's [CLS] pooling induces a structured loss across the spectrum. These findings identify intermediate layers and pooling mechanisms as primary drivers of spectral transformation in modern vision encoders.