CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation
2026-05-08 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors address the problem of generating correct SQL queries from natural language questions, especially for very hard tasks in the Bird-Bench dataset. They propose CA-SQL, a system that adjusts how many possible query solutions it tries based on how difficult the task seems, uses a special method inspired by evolutionary search to encourage variety in its guesses, and then votes to pick the best query. Their approach works better than previous methods, even those using bigger models, achieving strong results on the challenging parts of the benchmark. Overall, the authors' method improves the exploration and selection of candidate solutions for Text-to-SQL problems.
Text-to-SQLBird-Bench (BIRD)inference-time learningsolution space explorationevolutionary searchprompt seedingvoting methodexecution accuracySoft F1 scoreGPT-4o-mini
Authors
James Petullo, Nianwen Xue
Abstract
While recent advancements in inference-time learning have improved LLM reasoning on Text-to-SQL tasks, current solutions still struggle to perform well on the most challenging tasks in the Bird-Bench (BIRD) benchmark. This is due to inadequate solution space exploration, which is necessary to uncover promising candidate queries that can be further refined to produce the correct output. To address this challenge, we introduce CA-SQL, a novel Text-to-SQL pipeline that utilizes the estimated difficulty of a task to dynamically scale the breadth of the exploration for generating solution candidates. In addition, we use a custom prompt seeding method, based on principles of evolutionary search, to further elicit exploratory behavior from the base LLM and a novel voting method to select the best candidate solution at the end of the search. Experiments demonstrate that our solution achieves a state-of-the-art score of 51.72% on the "challenging" tier of BIRD development set problems, using only GPT-4o-mini, out-performing other in-context learning approaches, even those that leverage larger models. Overall, our method attains a competitive 61.06% execution accuracy and 68.77% Soft F1 score on the BIRD development dataset.