HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs

2026-06-22 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors created HOLMES, a new test for AI that checks if large language models can understand complicated rules and logic that go beyond simple facts. Unlike past tests, HOLMES focuses on higher-order logic involving reasoning about rules, functions, and constraints, especially in real-world areas like law and finance. Their experiments show that current AI models often struggle with these tasks, doing only about half as well as might be hoped. They also found that some models appear to guess answers correctly without truly understanding the reasoning. This suggests that better handling of higher-order logic is needed for more reliable AI.

Higher-Order LogicLarge Language ModelsSymbolic ReasoningBenchmarkNatural Language ProcessingFormalizationLaw and Finance AICompositional ReasoningVerifiable Reasoning Traces

Authors

Yucheng Wu, Jundong Xu, Mingzhen Ju, Yue Yu, Chenpeng Wang, Haoxuan Li, Liangming Pan

Abstract

Logical reasoning is essential for reliable AI, yet existing benchmarks are largely first-order-logic-centric, focusing on object-level deduction over fixed predicates. This misses many realistic scenarios where models must reason over rules, predicates, functions, constraints, and decision procedures themselves. We introduce HOLMES (Higher-Order Logic Meets real-world Explainable Symbolic reasoning), the first real-world benchmark for higher-order symbolic reasoning in LLMs, containing 1379 instances. Built on higher-order logic, HOLMES pairs natural-language problems with HOL formalizations, ground-truth answers, verifiable reasoning traces, and fine-grained controllable reasoning factors across law and finance. Experiments show that current LLMs still struggle on HOLMES, with an average accuracy of only 50.64% and the best model reaching 59.54%. Our analyses further reveal that high final-answer accuracy can mask shortcut reasoning in conflict-resolution settings, while performance drops sharply under scope-conditioned and compositional reasoning. These findings identify higher-order symbolic reasoning as a key bottleneck for building reliable and verifiable LLMs. The project code and dataset are publicly available at https://github.com/wuyucheng2002/HOLMES.

View PDFOpen arXiv