The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

2026-06-18Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors studied how reliable the Frechet Inception Distance (FID) score is when evaluating image generation models. They found that retraining the model with different random seeds causes much bigger changes in FID than just generating new images from the same trained model. They also discovered that things like random initial settings, data order, and noise during training affect FID variability. Increasing the model size or computation doesn’t reduce this variability much, but fine-tuning certain settings helps. Based on this, they suggest reporting FID with error bars from multiple training runs rather than a single number.

Frechet Inception Distanceimage generationrandom seedmodel retrainingvarianceclassifier-free guidancecoefficient of variationImageNetflow-matching loss
Authors
Nicolas Dufour, Alexei A. Efros, Patrick Pérez
Abstract
The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.