PathAR: Structure-First Autoregressive Synthesis of Multimodal Pathology Images
2026-06-01 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors present PathAR, a framework that generates pathology images by separately modeling the structure and appearance of tissues. Unlike previous methods that mix these features, their approach keeps the tissue shape consistent even when the image style changes across different medical imaging methods. They use a special tokenizer to break down images into structure and appearance parts and a transformer model to generate images in a way that the structure guides the appearance. This method helps create better-quality synthetic pathology data that can improve tasks like segmentation especially when real data is limited.
multimodal pathologygenerative modelsautoregressive synthesisvector quantizationtransformerstructure-appearance factorizationmodality shiftmask-grounded tokensimage generationsegmentation
Authors
Yuan Zhang, Jiahao Xia, Junzhang Huang, Meng Wang, Feng Chen, Guanyu Yang, Huazhu Fu
Abstract
Data scarcity in multimodal pathology motivates unified generative models that synthesize modality-specific appearance while preserving anatomically coherent structure. Although modalities differ in appearance statistics, morphological structures such as cellular topology and tissue boundaries are largely preserved across acquisition protocols. However, existing methods often model these factors within a homogeneous token stream, implicitly coupling structure with appearance and weakening structural controllability under modality shifts. To address this, we propose pathology Autorgressive modeling (PathAR), a structure-first autoregressive synthesis framework that explicitly factorizes structure and appearance for modality-label-conditioned pathology generation.PathAR employs a dual vector quantization (Dual-VQ) tokenizer to decompose samples into mask-grounded structure and appearance tokens, and an interleaved autoregressive (IAR) transformer with asymmetric attention visibility to enforce structure-to-appearance dependence. PathAR stabilizes morphology under heterogeneous modality-specific appearances and enables spatially aligned image--mask pair generation. Extensive experiments show that PathAR improves structural consistency and modality fidelity over baselines, maintains sample diversity, supports downstream segmentation in data-scarce regimes, and demonstrates extensibility to finer-grained intra-modality organ-label variation.