ContinuousBench: Can Differentially Private Synthetic Text Improve Capabilities?
2026-06-01 • Machine Learning
Machine LearningComputation and LanguageCryptography and Security
AI summaryⓘ
The authors study whether synthetic text generated with differential privacy (DP) truly captures new information from sensitive data. They point out that existing tests don't prove this because the tasks can be solved without the original data. To fix this, they create ContinuousBench, a new test that uses fresh, specially designed questions that need the original data to answer correctly. Their results show that while non-private synthetic data carries useful knowledge, current DP methods generally fail to do so, even with high privacy parameter values. This suggests DP text synthesis may not yet be a reliable substitute for direct data access.
Differential PrivacyText SynthesisMachine Learning BenchmarksSynthetic DataCapability EvaluationContinuous BenchmarkingQuestion AnsweringPrivacy Parameter (ε)Data Privacy
Authors
Peihan Liu, Lucas Rosenblatt, Weiwei Kong, Natalia Ponomareva, Gautam Kamath, Rachel Cummings, Roxana Geambasu, Yu Gan, Lillian Tsai, Alex Bie
Abstract
Differentially private (DP) text synthesis promises to unlock sensitive corpora for model training, but it remains unclear whether DP synthetic data transmits genuinely new knowledge and capabilities present only in those corpora. This is because existing evaluations rely on tasks that are nearly solvable without training, so strong benchmark performance does not establish that DP synthesis can substitute original data access. Thus, we introduce ContinuousBench, a continuously and automatically-regenerated benchmark that measures capability gain from DP synthetic text. Each quarter, a new release pairs a never-before-seen training corpus with a derived QA set, constructed to be: (1) unsolvable sans-corpus; and (2) learnable under DP, as the tested knowledge is supported by hundreds of independent records. Researchers produce DP synthetic data from the training corpus and run our standardized training and evaluation harness on their synthetic data to measure gains. We instantiate two tracks: Geminon, a procedurally-generated dataset about fictional creatures; and News, a stream of newly crawled public news articles. Although standard benchmarks are nearly saturated, on ContinuousBench we find that non-private synthesis transfers substantial knowledge from the original corpus, while state-of-the-art DP synthesis methods generally fail to do so, even at $\varepsilon=100$.