Many Needles in a Haystack: Active Hit Discovery for Perturbation Experiments

2026-05-11 • Machine Learning

Machine Learning

AI summaryⓘ

The authors address the challenge of finding as many important gene changes as possible in experiments where only a limited number of tests can be run. They point out that traditional methods either waste resources exploring less useful options or focus too much on just one top candidate. To improve this, they propose a new method called Probability-of-Hit, which prioritizes gene changes most likely to have a significant effect. Their approach is proven to work well in theory and shows better results in tests with both simulated data and real immunology experiments.

gene perturbationhit discoveryBayesian optimizationexperimental designacquisition functionposterior probabilitythreshold exceedanceimmunology datasets

Authors

Andrea Rubbi, Arpit Merchant, Samuel Ogden, Amir Akbarnejad, Pietro Liò, Sattar Vakili, Mo Lotfollahi

Abstract

High-throughput gene perturbation experiments can test several genetic interventions in parallel, yet experimental budgets remain limited. A central goal is hit discovery: identifying as many perturbations as possible whose phenotypic effect exceeds a predefined threshold. Pure exploration strategies are statistically inefficient, wasting budget on low-value regions. Bayesian optimization methods offer a principled alternative but target a single global optimum, over-exploiting dominant modes while neglecting other high-value regions. We formalize hit discovery as a sequential experimental design problem and propose Probability-of-Hit, an acquisition function that directly targets threshold exceedance by ranking candidates according to their posterior probability of being a hit. We prove asymptotic optimality of this approach and demonstrate strong empirical performance on both synthetic benchmarks and real biological immunology datasets, including up to 6.4% improvement over baselines on the Schmidt IL-2 dataset.

View PDFOpen arXiv