Doing well with less! On Sampling Techniques for Empirical Pairwise Loss Estimation/Minimization

2026-06-01Machine Learning

Machine Learning
AI summary

The authors look at machine learning tasks that compare pairs of items but can be very slow because they use every possible pair. They show that by smartly picking only some pairs—not individual items—to examine, you can get nearly the same results while saving time. Their method uses special sampling techniques that prioritize pairs likely to be most useful based on extra information. This approach works well especially for comparing complex data like images or graph features, balancing accuracy with faster computations.

pairwise losssimilarity learningrankingclusteringsurvey samplinghigh-dimensional vectorsembeddingscomputational costauxiliary informationmachine learning optimization
Authors
Louise Davy, Stephan Clémençon, Charlotte Laclau
Abstract
Many machine learning problems, including similarity learning, ranking, and clustering, rely on empirical pairwise loss functions whose quadratic computational cost quickly becomes prohibitive at scale. We demonstrate how a frugal approach that retains only a fraction of the available information on pairs can achieve estimation or optimization performance comparable to that obtained by using all pairs, by leveraging survey sampling techniques. A central finding, supported by both theory and experiments, is that such sampling plans must target pairs directly rather than individual observations. In particular, for pairwise losses between high-dimensional vectors such as embeddings in vision or graph learning, assigning higher inclusion probabilities to informative pairs using suitable auxiliary information yields performance close to full pairwise evaluation, providing a principled and theoretically grounded trade-off between accuracy and computational cost.