Same Concept, Different Directions: Cross-Modal Feature Heterogeneity in Sparse Autoencoders

2026-06-29Machine Learning

Machine LearningComputer Vision and Pattern Recognition
AI summary

The authors studied how vision-language models combine images and text into shared representations but found that the same features don't line up perfectly between the two types. They call this mismatch 'cross-modal feature heterogeneity,' which causes the model to activate different parts for the same idea depending on whether it's looking at an image or text. To fix this, the authors trained separate sparse autoencoders for each modality to keep their unique feature structures and then matched features afterward. This approach improved how well the model could recreate data and find related items across image and text.

vision-language modelsjoint embedding spacesparse autoencodersmodalitycross-modal feature heterogeneitylatent activationsfeature geometrycross-modal retrievalconcept steering
Authors
Chungpa Lee, Jihoon Kwon, Kyle Min, Jy-yong Sohn
Abstract
Vision-language models map images and text into a joint embedding space. However, these embeddings often entangle multiple semantic features, which limits their interpretability and controllability. While sparse autoencoders have emerged as a useful tool for decomposing these embeddings into monosemantic features, their application to joint embedding spaces has largely relied on an implicit, untested assumption that semantically corresponding features share the same directions across modalities. In this paper, we challenge this assumption by identifying discrepancies in feature directions for the same concept across image and text modalities, a phenomenon we term cross-modal feature heterogeneity. We demonstrate that this heterogeneity is a key driver of the modality split, where a shared concept activates different latents depending on the modality. This finding further reveals why aligning latent activations alone is insufficient to resolve the underlying feature mismatch. Motivated by this observation, we propose an approach that trains modality-specific sparse autoencoders to preserve each modality's feature geometry, and then aligns corresponding features post hoc. Our method improves reconstruction fidelity and enhances performance in cross-modal retrieval and concept steering.