SelPE: Progressive Selection for Private Structured Text Synthesis

2026-06-22Cryptography and Security

Cryptography and Security
AI summary

The authors address the problem of creating private synthetic structured text data when very few real examples are available, such as in sensitive areas like healthcare or finance. They propose SelPE, a method that carefully selects and evolves candidate text sequences within strict privacy limits instead of training large noisy models. SelPE generates the text in two stages to keep the structure accurate and uses a specialized way to measure similarity that respects different data types. Their experiments show that SelPE produces more realistic and useful synthetic data, especially when the original dataset is very small.

Differential PrivacyStructured TextData SynthesisPrivacy BudgetSemantic AbstractionSchema RealizationContrastive ExpansionMulti-channel Distance KernelLow-data RegimesSynthetic Data Generation
Authors
Xuancheng Zhu, Guoshun Nan, Han Zhang, Ben Niu, Yang Yue, Zixu Wang, Yilian Liu, Min Lei, Xiaofeng Tao
Abstract
Many data-driven applications rely on structured textual records, such as clinical triage notes and financial transaction logs, for downstream learning and decision-making. In privacy-sensitive domains, access to such records is strictly regulated, often resulting in only a small number of available private examples for model development and analysis. Yet existing differential privacy data synthesis methods fall short: tabular techniques cannot faithfully model free-form text, while text-based approaches often break structural constraints. We propose SelPE, a selection-guided progressive evolution framework for small-sample private structured text synthesis. Rather than relying on noisy aggregation or private model training, SelPE concentrates privacy budget on a sequence of multi-batch top-1 selections, enabling efficient guidance under tight privacy constraints. To support faithful and valid synthesis, SelPE decouples semantic abstraction from schema realization via a two-stage generation pipeline, and evaluates candidates using a multi-channel distance kernel that jointly models textual, categorical, and numeric fields in their native representations. A non-private contrastive expansion mechanism further promotes diversity without incurring additional privacy cost. Extensive Experiments demonstrate that SelPE consistently improves structural validity, fidelity, and downstream utility under strict differential privacy budgets, particularly in low-data regimes.