Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

2026-06-15Computation and Language

Computation and Language
AI summary

The authors studied how well language models (LLMs) can learn hidden rules by asking yes/no questions and checking if their guesses are correct, similar to solving a puzzle step-by-step. They tested these models on a controlled task involving hidden patterns called deterministic finite automata (DFAs). They found that as the complexity of the hidden pattern grows, the models struggle more, especially with planning questions and putting together clues. Models designed to reason did better but still lag behind classic, well-established algorithms. Overall, the authors show that while LLMs occasionally succeed, they are not yet as reliable or efficient as traditional methods for this kind of learning task.

Large Language Models (LLMs)Deterministic Finite Automata (DFA)Agentic Automata LearningMembership QueriesEquivalence QueriesOracle InteractionInteractive DiscoveryQuery PlanningHypothesis ConstructionClassic Automata Learning Algorithms
Authors
Reef Menaged, Gili Lior, Shauli Ravfogel, Roee Aharoni, Gabriel Stanovsky
Abstract
We propose agentic automata learning to evaluate the extent to which tool-calling LLM agents can uncover hidden environments through interaction. In our setup, an agent should uncover a hidden deterministic finite automaton (DFA) by interacting with an oracle through (1) membership queries ("Does this string belong to the target language?") and (2) equivalence queries ("Is this the target DFA?"). This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines (classic automata-learning algorithms). Evaluating state-of-the-art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than non-reasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, our results show that current LLM agents can sometimes perform non-trivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.