The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

2026-05-11Artificial Intelligence

Artificial IntelligenceComputation and LanguageComputer Vision and Pattern Recognition
AI summary

The authors created KnotBench, a large dataset of knot images paired with tasks to test how well vision-language models understand knots. They evaluated models like Claude and GPT-5 on tasks such as recognizing knots and predicting moves, finding that the models often struggled and sometimes performed barely better than random guessing. Even with extra reasoning steps, these models had limited success in fully interpreting knot diagrams or simulating transformations. The study suggests that while current models can see knot features, they cannot yet manipulate or reason about them effectively.

knot diagramvision-language modelprime knotcrossing numberknot signatureequivalence judgmentmove predictionRegina softwarediagram-to-symbol transcriptioncross-modal grounding
Authors
Hao Liu, Jicheng Liu
Abstract
A vision-language model can look at a knot diagram and report what it sees, yet fail to act on that structure. KnotBench pairs an 858,318-image corpus from 1,951 prime-knot prototypes (crossing numbers 3 to 19) with a protocol whose answers are checked against Regina's canonical knot signature. Its 14 tasks span four families, equivalence judgment, move prediction, identification, and cross-modal grounding; an image-versus-symbol split locates failures along the perception-operation gap. We score Claude Opus 4.7 and GPT-5, each with and without thinking, under a 64K output-token budget matched on both vendors. Across 56 (task, model) cases, 15 sit at or below a random baseline and 8 of 14 tasks have a best score under 1.5x random. On diagram-to-symbol transcription, no model produces a strictly correct string, and permissive Regina decoding recovers the knot in 0 to 4 of 100 items. Thinking-mode reasoning lifts overall accuracy by 1.65 points for Claude and 9.25 points for GPT-5, narrowing the gap only modestly. Read together, the four families suggest current vision-language models hold features of a diagram but lack apparatus to simulate moves on those features.