Filtered ANN as a Phase Transition: When Selectivity-Estimation Error Causes Plan Regret

2026-06-15 • Machine Learning

Machine LearningDatabases

AI summaryⓘ

The authors study how to choose the best way to run filtered approximate-nearest-neighbor (ANN) searches depending on the selectivity of a filter, which is how many items pass a certain condition. They find that small errors in estimating selectivity mostly cause problems only near specific boundary points where the best method changes. These boundaries are mathematically linked to particular parameters like the number of neighbors requested and graph degree, and the authors confirm these findings with experiments on synthetic and real data. Their work helps understand when mistakes in selectivity estimation affect search quality but does not introduce a new search algorithm. They also provide code and pre-registered analysis details publicly.

approximate nearest neighbor (ANN)filtered queryselectivityphase boundaryplan regretorder statisticssite percolationfinite-size scalingselectivity estimationgraph degree

Authors

Madhulatha Mandarapu, Sandeep Kunkunuru

Abstract

A filtered approximate-nearest-neighbor (ANN) query returns the k nearest vectors among those satisfying an attribute predicate P of selectivity s. The best execution strategy -- pre-filter, post-filter, or in-filter -- changes with s, so a system must estimate s and choose. We model this as an argmax over a landscape with phases (regions where each strategy wins) separated by boundaries, and show that selectivity-estimation error produces plan regret -- recall lost versus the oracle strategy -- only in the critical regions around those boundaries. The regret is a wedge of log-width equal to the multiplicative estimation error epsilon and height equal to the local cliff |V'(s*)| epsilon; the flip-margin 1/|V'(s*)| is the condition number of a sibling cardinality-estimation study reappearing as the local boundary theory. The two phase boundaries follow from independent mathematics: order statistics place the post-filter cliff at s ~ k/K, and site percolation places the in-filter cliff at s_c ~ 0.83/M for graph degree M (corpus-size independent). Criticality exists only under a constrained budget B < sqrt(k n). Under pre-registered decision rules we confirm, on synthetic sweeps and real SIFT1M, that regret concentrates ~290x at the boundary and that the regret curves obey a finite-size scaling collapse onto one universal wedge across two decades of corpus size. A real approximate index does not mis-locate the boundary, but a biased cost model opens a persistent miscalibration band that estimation-error robustness cannot fix. The contribution is a characterization, not a new index. Code and the full pre-registration are public.

View PDFOpen arXiv