AI summaryⓘ
The authors created COCOLogic-V2, a new dataset that helps test how well computer models can reason about objects in real-world images using logic rules. They organized the data into easy-to-distinguish positive cases and two types of negative cases that are harder to tell apart, allowing detailed checks on where models struggle. Their tests show that models do well with clear positive or very different negative examples but have trouble with tricky cases that are close to the decision boundary. They also found that noise in images and complex reasoning tasks make learning harder, especially with only a few examples. Overall, the authors emphasize that teaching models to reason visually is still difficult and their dataset can help improve research in this area.
concept bottleneck modelsprogram synthesisvisual inductive reasoningfirst-order logicobject-centric datasetfew-shot learningnear-boundary samplesCOCOLogic-V2model accountabilityperceptual noise
Authors
David Steinmann, Antonia Wüst, Kristian Kersting, Wolfgang Stammer
Abstract
While interpretable models such as concept bottleneck models (CBMs) and program synthesis methods enable verification of model decisions, their evaluation is typically limited to simple tasks, leaving complex reasoning on real-world images largely unexplored. We introduce COCOLogic-V2, an object-centric dataset for visual inductive reasoning on real-world images covering a broad subset of first-order logic. By categorizing samples into positive variants, near-boundary (NB), and far-from-boundary (FB) negatives, COCOLogic-V2 enables fine-grained diagnosis of model accountability. Our evaluations show that models tend to separate positive and FB samples well but fail on NB samples, while perceptual noise and large rule-induced search spaces pose additional challenges in few-shot settings. Together, these results highlight that visual inductive reasoning remains an open challenge and COCOLogic-V2 provides a concrete foundation for advancing methods in this direction.