NeuReasoner: Theory-grounded Mapping of Reasoning Elicitation Boundaries

2026-06-29Machine Learning

Machine Learning
AI summary

The authors studied how large language models can show reasoning skills without extra training by using a new method called NeuReasoner, which combines ideas about brain functions and reasoning steps inside one model. They tested it on various tasks from math, coding, and psychology to see when this reasoning can be uncovered just by asking differently. They found NeuReasoner works well for tasks like arithmetic and Bayesian reasoning but struggles with risky decision-making under uncertainty. The authors also discovered that the size of the model affects how well these reasoning skills can be revealed. Overall, their work shows where reasoning can or cannot be recovered from language models without extra training.

large language modelselicitationNeuReasonerfunctional specificityErotetic Theory of ReasoningCogBencharithmetic reasoningBayesian reasoningdecision makingmodel scale
Authors
Aydin Javadov, Shyngys Aitkazinov, Tobias Hoesli, Florian von Wangenheim, Bjoern Schuller, Joseph Ollier
Abstract
A growing body of work suggests that the reasoning capabilities of large language models are largely latent in their base form, with post-training primarily amplifying rather than introducing them. However, this evidence comes mainly from mathematical and coding benchmarks, leaving the boundary conditions of that claim largely unexplored, namely which cognitive tasks can be recovered through elicitation and where that recovery fails. To investigate this, we introduce NeuReasoner, a theory-grounded elicitation instrument. At each step, an orchestrator pairs a Neuro Lens, inspired by functional specificity, with a Cognitive Lens, drawn from the Erotetic Theory of Reasoning, and integrates their outputs through internal modularization of a single model, without external tools. We evaluate NeuReasoner on CogBench, a suite of behavioral tasks from cognitive psychology, alongside standard mathematical and coding benchmarks, measuring both its improvement over vanilla inference and its ability to match a model's post-trained thinking mode. At sufficient scale, NeuReasoner matches or exceeds thinking-mode baselines on arithmetic reasoning, code generation, Bayesian reasoning, and reward learning; these gains persist against self-consistency and iterative-refinement baselines matched to NeuReasoner's per-decision call budget. Using NeuReasoner allows us to find clear boundaries: risk-taking and decision making under uncertainty remains hard to recover through elicitation alone, and model scale interacts with elicitation in both directions: widening its advantage on some cognitive signatures while erasing it on others. Overall, through NeuReasoner as a modular, interpretable, theory-grounded elicitation instrument, we empirically map where reasoning elicitation succeeds and fails, beyond the mathematical and coding benchmarks where prior claims have rested.