Decoupling Semantics from Distortions: Multi-Scale Two-Stream Vision-Language Alignment for AI-Generated Image Quality Assessment

2026-06-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors address a problem in AI image quality assessment where current models mix up understanding the overall meaning of an image with noticing small quality problems. They propose MST-CLIPIQA, a system that looks at images on two scales: a broad view for overall meaning and a detailed view for fine textures and flaws. By combining these views in a smart way, their method better judges both image quality and how well images match text prompts. They tested their system on several benchmarks and found it performs better while using few extra training parameters.

Vision-language modelsImage quality assessmentSemantic perceptionMulti-scale analysisCLIP encoderPatch granularityInformation bottleneckCross-scale fusionText-image correspondenceSpearman Rank Correlation Coefficient (SRCC)

Authors

Zijie Meng

Abstract

Existing vision-language model (VLM)-based AI-generated image quality assessment (AIGIQA) methods suffer from a fundamental semantic-distortion dimensional conflict: monolithic representations optimized for semantic discrimination inherently entangle compositional understanding with low-level perceptual sensitivity, rendering them blind to fine-grained quality degradations. We introduce MST-CLIPIQA, a multi-scale two-stream framework that achieves hierarchical vision-language alignment through explicit representational decoupling. Our architecture leverages dual CLIP encoders with complementary patch granularities: coarse-grained streams capture global semantic coherence while fine-grained streams preserve textural signatures and artifact patterns. An information bottleneck-inspired gated fusion mechanism performs adaptive cross-scale distillation, with optional cross-attention enabling prompt-anchored correspondence evaluation when generation prompts are available. Extensive experiments across five benchmarks establish new state-of-the-art results, achieving average improvements of 1.11 percent SRCC on quality and 2.35 percent SRCC on text-image correspondence prediction, while maintaining efficiency with only 0.8M trainable parameters. Our project is available at https://github.com/YMlinfeng/MST-CLIPIQA.

View PDFOpen arXiv