SynQL: A Controllable and Scalable Rule-Based Framework for SQL Workload Synthesis for Performance Benchmarking

2026-04-09 • Databases

Databases

AI summaryⓘ

The authors address the problem of lacking real SQL query data due to privacy rules by creating SynQL, a tool that generates realistic and diverse SQL queries for training database systems. Instead of using random text generation that often makes mistakes, SynQL builds queries directly by following the actual database structure, ensuring all queries are valid and realistic. Their method can control different aspects of queries like how tables join and how selective filters are. Tests on standard database benchmarks showed SynQL creates a wide variety of queries that help train models to predict query costs accurately and quickly, even when real data isn't available.

SQLquery optimizerworkload synthesisforeign-key graphabstract syntax tree (AST)join topologypredicate selectivityTPC-H benchmarkcost modelmachine learning

Authors

Kahan Mehta, Amit Mankodi

Abstract

Database research and the development of learned query optimisers rely heavily on realistic SQL workloads. Acquiring real-world queries is increasingly difficult, however, due to strict privacy regulations, and publicly released anonymised traces typically strip out executable query text to preserve confidentiality. Existing synthesis tools fail to bridge this training data gap: traditional benchmarks offer too few fixed templates for statistical generalisation, while Large Language Model (LLM) approaches suffer from schema hallucination fabricating non-existent columns and topological collapse systematically defaulting to simplistic join patterns that fail to stress-test query optimisers. We propose SynQL, a deterministic workload synthesis framework that generates structurally diverse, execution-ready SQL workloads. As a foundational step toward bridging the training-data gap, SynQL targets the core SQL fragment -- multi-table joins with projections, aggregations, and range predicates -- which dominates analytical workloads. SynQL abandons probabilistic text generation in favour of traversing the live database's foreign-key graph to populate an Abstract Syntax Tree (AST), guaranteeing schema and syntactic validity by construction. A configuration vector $Θ$ provides explicit, parametric control over join topology (Star, Chain, Fork), analytical intensity, and predicate selectivity. Experiments on TPC-H and IMDb show that SynQL produces near-maximally diverse workloads (Topological Entropy $H = 1.53$ bits) and that tree-based cost models trained on the synthetic corpus achieve $R^2 \ge 0.79$ on held-out synthetic test sets with sub-millisecond inference latency, establishing SynQL as an effective foundation for generating training data when production logs are inaccessible.

View PDFOpen arXiv