Cascaded Sparse Autoencoders Learn Multi-Level Visual Concepts in Multimodal LLMs

2026-06-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors explore ways to make visual features inside multimodal large language models (MLLMs) easier to understand. They introduce a new method called cascaded sparse autoencoders (CSAEs) that organizes visual concepts hierarchically by learning higher-level ideas from lower-level features. This method helps reveal more meaningful groups of visual concepts compared to older methods that only find flat features. Their experiments show CSAEs improve the clarity of these concept groupings and allow better control over model outputs by steering groups of related concepts. Overall, the authors provide a tool to better interpret and manipulate how MLLMs understand images.

Multimodal Large Language ModelsSparse AutoencodersHierarchical ConceptsVisual RepresentationsModel InterpretabilityFeature DecompositionConcept SteeringQwen3-VLGemma-3LLaVA

Authors

Yusong Zhao, Hengyi Wang, Tanuja Ganu, Akshay Nambi, Hao Wang

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language tasks, yet their internal visual representations remain difficult to interpret. Sparse Autoencoders (SAEs) provide a scalable way to decompose dense model activations into sparse, interpretable features. However, existing SAE architectures primarily recover flat feature dictionaries and are less suited for explicit multi-level concept organization. In this paper, we introduce cascaded sparse autoencoders (CSAEs) for learning hierarchical visual concepts in MLLMs. Rather than nesting or stacking SAE sparse activation codes, CSAEs train a second-level SAE directly on the decoder weights of the first-level SAE, treating learned low-level feature directions as inputs for higher-level abstraction. This design enables CSAEs to learn "concepts of concepts" while avoiding drawbacks from the shared-prefix coupling of nesting, Matryoshka-style hierarchies and the bottlenecks of naively stacked SAEs. Experiments across Qwen3-VL, Gemma-3, and LLaVA on multiple visual datasets show that CSAEs improve interpretability in terms of hierarchical concept coherence over state-of-the-art SAE baselines. Results on concept steering further demonstrate that the learned concept groups support effective group-level interventions in MLLM outputs.

View PDFOpen arXiv