Assessing Sample Quality in Conditional Generation under Compositional Shift

2026-06-08Machine Learning

Machine Learning
AI summary

The authors developed a way to check how trustworthy generated samples are when creating new data based on conditions that haven't been seen before, like rare or unseen scenarios. Their method uses a score combining how realistic a sample is and how well it matches the requested features, all without needing examples from the new conditions. This helps to rank and filter generated outputs, improving results in tasks like biological imaging and vision tests, and can also stop generation early if needed. The approach works on existing models without extra training.

Conditional generationExtrapolationData manifoldAttribute-wise faithfulnessGenerative modelsOut-of-distribution detectionBiological imagingSample quality evaluationControllable generationModel abstention
Authors
Berker Demirel, Valentino Maiorca, Marco Fumero, Theofanis Karaletsos, Francesco Locatello
Abstract
Conditional generators provide a natural tool for controllable generation, including settings where the desired condition is a new composition of observed attributes or experimental factors. In many applications, especially in scientific domains, such models are attractive to explore conditions for which real samples are rare, expensive, or not yet observed. However, this creates a circularity for evaluation: standard conditional quality metrics require a reference target distribution, but in the extrapolative regime that distribution is unavailable by definition. We address this problem with a post-hoc, per-sample trust score for assessing conditional samples using only the training distribution. The score combines two estimable quantities: global realism, measuring compatibility with the real data manifold, and attribute-wise faithfulness, measuring whether a sample is closer to the requested attributes than to plausible alternatives. We show that the score can recover meaningful comparisons across extrapolated generations, under a mild coverage condition on the observed attributes. These comparisons enable effective filtering, ranking, and abstention of generations and can be used directly on off-the-shelf pretrained models. In biological imaging, selected samples preserve real morphological structure better and improve downstream predictive performance, while similar gains are observed on controlled vision benchmarks. Finally, we show how the score can be applied during generation, enabling abstention before full decoding. Code is available at https://github.com/berkerdemirel/faithful-cond-gen.