The Abstraction Gap in Vision-Language Causal Reasoning

2026-05-27Computation and Language

Computation and LanguageComputer Vision and Pattern Recognition
AI summary

The authors studied vision-language models (VLMs) that explain causes behind images. They created a new test called CAGE that checks if these models truly understand causes or just produce nice-sounding explanations. Their test uses two probes: one checks language quality alone, and the other checks reasoning by making models explain step-by-step chains of cause. Most models did well on language but poorly on actual reasoning, even after more training. The authors found that some models can reason well depending on how they were trained, and their test helps measure this important ability.

vision-language modelscausal reasoningcausal chainsCAGE benchmarklinguistic plausibilityabstraction gappretrainingfine-tuningPearl's causal hierarchy
Authors
Chinh Hoang, Mohammad Rashedul Hasan
Abstract
Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properties. The Text-Only Probe measures linguistic quality. The Chain-Text Probe requires models to first generate explicit causal chains. The Abstraction Gap (AG) metric quantifies the normalized performance difference. Evaluating eight VLMs on CAGE (Causal Abstraction Gap Evaluation), a benchmark of 49,500 questions across 5,500 images spanning Pearl's causal hierarchy, we find seven models exhibit AG exceeding 0.50 with text scores of 6--8 but chain scores below 2.5. Fine-tuning on 45,000 chain-annotated examples fails to close the gap. However, one model achieves near-zero AG. The capability exists within current VLM architectures and depends on pretraining and architectural choices. CAGE provides a diagnostic tool for assessing faithful causal reasoning in VLMs.