Why Are DMD Students Lazy? Understanding the Copying Behavior in Few-Step Distillation

2026-06-01Machine Learning

Machine Learning
AI summary

The authors studied a way to make big diffusion models smaller and faster by teaching a smaller model to mimic the bigger one’s noise patterns. They found that while smaller models in low dimensions can choose different noise patterns, in high dimensions the smaller models surprisingly copy the exact noise patterns from the bigger model. This copying isn’t because of tricks like adversarial training or memorization, but seems to happen naturally due to geometric limits when working with high-dimensional data. The authors call this unexpected behavior "copying."

Diffusion modelsModel distillationDistribution matchingHigh-dimensional dataLatent noiseAdversarial trainingTeacher-student modelsGeometric constraints
Authors
Shucheng Li, Iolo Jones, Alexander Tong, Michael M. Bronstein
Abstract
Distribution Matching Distillation (DMD) compresses pretrained diffusion models into efficient few-step generators by aligning their noised distributions across all scales. In principle, such distribution-level supervision remains agnostic to specific noise-data pairings of the teacher; this provides the student the freedom to remap latent noise, a behavior consistently observed in low-dimensional settings. Surprisingly, we find that in high-dimensional settings, distilled students spontaneously reproduce the original noise-data pairings of the teacher, a phenomenon we term copying. We demonstrate that copying is neither a byproduct of adversarial objectives nor a result of teacher memorization. Instead, our evidence suggests that copying is an emergent property arising from the limited geometric freedom of the student model during high-dimensional distillation.