PrivCode++: Latent-Conditioned Differentially Private Code Generation for Comprehensive Guarantees

2026-06-08Cryptography and Security

Cryptography and Security
AI summary

The authors address the problem of training large language models on both sensitive code and prompts without leaking private information. They introduce PrivCode-Plus, a new method that uses a two-step privacy approach to fine-tune models without directly seeing the sensitive data. This helps create better and more diverse code outputs while keeping privacy strong. Their experiments show that PrivCode-Plus works better than existing methods and still protects data effectively.

Differential PrivacyLarge Language ModelsCode GenerationFine-tuningPrivacy LeakageSensitive DataData SynthesisPrivacy-Free Latent Conditioning
Authors
Zheng Liu, Chen Gong, Terry Yue Zhuo, Zhou Yang, Kecen Li, Wenlong Meng, Xinwen Hou, Yu Liu, Xiaochen Li
Abstract
Large language models fine-tuned on instruction-code pairs may memorize and subsequently leak sensitive training data. Existing differentially private (DP) code generation methods primarily protect code snippets while assuming prompts are public, which fails in realistic scenarios where prompts may also contain sensitive information. When prompts cannot be explicitly learned or used during generation, code synthesis suffers from severe utility degradation as well as reduced diversity and fidelity. To address these challenges, we propose PrivCode-Plus, the first work to explore DP code generation where both prompts and code snippets are considered sensitive in LLM fine-tuning. PrivCode-Plus introduces a two-stage DP framework with a Privacy-Free Latent Conditioning module, enabling effective DP fine-tuning and data synthesis without direct access to sensitive prompts or code. Extensive experiments show that PrivCode-Plus achieves substantially higher utility than baselines, remains competitive with the method with relaxing privacy assumptions, and provides stronger privacy guarantees.