Structural Grid Descriptors Predict Within-Task Solver Success on ARC-AGI

2026-06-08 • Machine Learning

Machine Learning

AI summaryⓘ

The authors studied how certain patterns in intermediate states of a puzzle-solving process predict whether two different AI solvers will succeed on specific tasks. They found that a single measure related to grid complexity reliably distinguishes successful from failed attempts, and this prediction works across both solvers and many tasks. Using this finding, they could stop the solving early without losing much success, saving computing time. They also found that many tasks fail immediately because the puzzle's rules don't allow any moves, not because of solver limits.

ARC (Abstraction and Reasoning Corpus)AGI (Artificial General Intelligence)beam searchStochastic Depth-First Search (SDFS)mutual informationgrid stateearly stoppingsearch budgetDSL (Domain Specific Language)

Authors

Ayan Pendharkar

Abstract

We ask whether structural properties of intermediate grid states predict whether a symbolic ARC-AGI solver will succeed, framed as a test of conditional mutual information I(X;Y|task) > 0. Across 44,800 runs spanning two architecturally distinct solvers (beam search and Stochastic DFS), 400 ARC tasks, 28 configurations per solver, and both training and evaluation splits, hand-crafted grid descriptors measured at 50% trajectory completion discriminate successful from failed runs within the same task (mean within-task best-feature AUC = 0.885, p < 0.001 under within-task label permutation). Most predictive content lies along a single grid-complexity axis. The result generalizes across solver architectures: a feature selected on one solver predicts success on the other with AUC 0.747-0.762 in all four transfer directions (p < 0.001, leakage controlled). On a pre-registered held-out set of 41 reliable tasks, the frozen feature n_components_final achieves AUC = 0.765 (95% CI [0.717, 0.810], p < 0.001), robust under task-clustered bootstrap resampling and cross-solver task collapsing. The signal is not explained by solver capacity (configuration-residualized AUC = 0.927 and 0.896 for beam search and SDFS, p < 0.001) and is only weakly coupled to score trajectories (R^2 approximately 0). Early stopping at 50% completion reduces beam-search compute by 33.6% while retaining 98.9% of solves; degenerate-trajectory detection reduces SDFS compute by 65.3% with no solve loss. Finally, on 229 of 400 evaluation tasks the DSL primitive library produces no valid transition from the input grid. This 0-step collapse is invariant to search budget and universally failed by beam search, indicating a DSL coverage limitation rather than a search-budget effect.

View PDFOpen arXiv