SemCEB: A Cardinality Estimation Benchmark for Semantic Operators

2026-06-22Databases

Databases
AI summary

The authors discuss how modern databases use language-based commands for complex queries involving text and images, which are much slower and more complex than traditional queries. They highlight the importance of accurately estimating how many results a query will return to avoid inefficient query plans. To address this, the authors introduce SemCEB, a new benchmark that tests different methods for estimating these result sizes on a real dataset. Their evaluation finds that sampling methods work reliably but are costly and slow, while another method called Semantic Histograms is faster but less flexible depending on the query type.

multi-modal large language modelssemantic operatorscardinality estimationquery optimizationsemantic filterssemantic joinssampling algorithmsSemantic Histogramsquery planbenchmark
Authors
Andreas Zimmerer, Claudius Kühn, Yang Li, Mihail Stoian, Renata Borovica-Gajic, Andreas Kipf
Abstract
Modern data systems increasingly expose multi-modal large language models as semantic operators: SQL operators, including filters and joins, whose predicates are defined by a natural-language instruction. Query optimization in these systems still rests on the same foundations as in traditional databases$\unicode{x2013}$plan enumeration and cost models$\unicode{x2013}$yet faces new challenges, e.g., a larger plan space and the lack of efficient cardinality estimates. The elevated per-tuple costs of semantic operators make bad plan choices worse by orders of magnitude. Therefore, precise$\unicode{x2013}$but also fast and cheap$\unicode{x2013}$cardinality estimates for semantic filters and joins are of high importance for optimizing query plans that include semantic operators. In this paper, we introduce SemCEB, the first benchmark for cardinality estimation over semantic operators, based on a real-world dataset of (semi-)structured text and images with 102 hand-curated, diverse queries spanning a wide range of selectivities, assessing cardinality estimation for semantic filters and joins in isolation. We evaluate sampling-based algorithms and Semantic Histograms, a state-of-the-art cardinality estimation algorithm for semantic operators, with respect to their accuracy, cost, latency, and memory overhead. We show that, while sampling is robust across different predicate categories, it does not scale and comes with high costs. Our adaptation of Semantic Histograms, on the other hand, is limited in its applicability, and its performance appears sensitive to the predicate category.