Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality Assessment

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors address the problem of judging image quality without having the original image to compare to, known as blind image quality assessment. They combine two different methods—natural scene statistics (NSS) and vision-language model (VLM) embeddings—using a special gating system that decides how much to trust each method depending on the image's type of distortion. Their approach improves results on several datasets and shows that NSS features are more useful for noise and color-shift problems, while VLMs help more with perceptual changes like color saturation. This gating system learns these preferences on its own without retraining the underlying vision models. Their work demonstrates a smart way to mix classic and modern techniques for evaluating image quality.

Blind Image Quality AssessmentNatural Scene StatisticsVision-Language ModelsDistortion-Aware FusionGating MechanismPerceived Image QualitySpearman Rank CorrelationKADID-10kKonIQ-10kCLIP
Authors
Bishr Omer Abdelrahman Adam, Xu Li
Abstract
Blind image quality assessment (BIQA) aims to predict perceived image quality without access to a reference image. Classical natural scene statistics (NSS) descriptors and modern vision-language model (VLM) embeddings address this problem from fundamentally different perspectives, yet whether combining them yields complementary benefits and how to weight their contributions per input image remains unexplored. We propose a distortion-aware fusion framework that integrates a 138-dimensional NSS descriptor with two complementary VLM embeddings, SigLIP and CLIP-H, through a multiplicative gating mechanism that learns per-input stream weights conditioned on image content. Unlike static concatenation fusion, the proposed gating network suppresses or amplifies each stream's contribution based on the input, producing weights that correlate positively (Spearman rank correlation rho=0.33) with the per-distortion NSS contribution measured by independent ablation on KADID-10k. The framework requires no end-to-end fine-tuning of the VLM backbones and is trained with a hybrid loss combining mean squared error, Pearson linear correlation, and pairwise ranking objectives. We evaluate on three standard benchmarks: KonIQ-10k (SROCC=0.9142, PLCC=0.9279), KADID-10k (SROCC=0.9715, PLCC=0.9733, surpassing recent state-of-the-art methods), and LIVE Challenge in-the-Wild (SROCC=0.8527, PLCC=0.8802 with cross-dataset pretraining and fine-tuning). A per-distortion analysis on KADID-10k reveals that NSS features contribute most on noise and color-shift distortions where pixel statistics are directly affected, and least on perceptual distortions such as color saturation changes. The learned gate values validate these findings, confirming that the model autonomously discovers distortion-stream affinity patterns consistent with the manual per-distortion study.