SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Multimodal Models

2026-06-22 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors created a new system called SPAR that helps computers both understand and generate images better by combining two ways of processing visuals: one that looks at the overall meaning and one that focuses on detailed pixels. They designed a special method to connect these two approaches inside the same model without needing extra external guides. Additionally, their system can adaptively decide how to mix information depending on what's needed for each part of the image. Their experiments show this method improves the quality of image generation and reconstruction while keeping strong visual understanding.

Multimodal Large Language Modelssemantic perceptionpixel-level reconstructiontokenizerdiffusion modelself-alignmentTransformerlatent spacedynamic token routing

Authors

Hongxiang Li, Hongxu Chen, Chenyang Zhu, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding but remain constrained in visual generation due to the fundamental feature discrepancy between semantic perception and pixel-level reconstruction. Bridging this gap requires overcoming two core challenges: endowing semantic encoders with high-fidelity reconstruction capabilities, and effectively aligning generative models with semantic spaces without relying on external teachers. To this end, we propose a novel unified multimodal framework featuring \textbf{S}emantic-\textbf{P}ixel self-alignment and \textbf{A}daptive \textbf{R}outing (\textbf{SPAR}). First, to reconcile semantic perception with pixel-level reconstruction, we introduce an asymmetric dual-stream unified tokenizer. A lightweight semantic stream anchors discriminative features, while a Transformer-augmented pixel stream recovers fine-grained visual details into a unified compact latent space. Second, to eliminate external dependencies, we propose a self-aligned generation paradigm that natively leverages this optimized tokenizer as an internal alignment teacher for the diffusion model. Furthermore, to facilitate flexible multimodal interaction within this unified space, we introduce Dynamic Token Routing, which enables each token to adaptively aggregate multi-layer MLLM features based on its distinct semantic demands. Extensive experiments demonstrate that SPAR establishes the state-of-the-art for unified architectures, achieving exceptional generation and reconstruction quality while preserving foundational visual understanding capabilities.

View PDFOpen arXiv