Auditing Machine Unlearning: A Systematic Research on Whether Models Truly Forget
2026-06-15 • Machine Learning
Machine Learning
AI summaryⓘ
The authors study how well different methods actually erase specific data from machine learning models, a process called machine unlearning. They create a new way to check if unlearning really works without needing heavy retraining or complex setups. Testing on many datasets and methods, they find some techniques truly remove data influence, while others only reduce model effectiveness or fail despite formal guarantees. Their checking method also works against trick attempts and applies well to large language models.
machine unlearningdata privacyauditingretrainingshadow modelsfine-tuningde-optimizationFisher informationHessian matrixlarge language models
Authors
Dayong Ye, Tianqing Zhu, Ruiding Huang, Xinbo Fu, Jiayang Li, Bo Liu, Huan Huo, Wanlei Zhou
Abstract
Machine unlearning has been extensively studied in response to growing privacy concerns and regulatory requirements. However, auditing whether unlearning algorithms have truly erased the influence of specific data remains an open challenge. The lack of reliable and practical auditing mechanisms can lead to critical privacy risks, such as residual information leakage. This paper initiates a systematic investigation into whether existing unlearning algorithms can truly forget the designated data. We propose the first practical and general-purpose auditing framework for machine unlearning, inspired by the concept of proof of ignorance. Our framework addresses the key practicality limitations of existing methods by eliminating the need for retraining-from-scratch baselines, avoiding the training of large numbers of shadow models, and requiring no intrusive intervention in the original training process. To evaluate the effectiveness of our framework, we first conduct validation experiments to verify its soundness and completeness. We then perform comprehensive experiments across six datasets and ten representative unlearning methods. The results demonstrate that our framework reliably distinguishes between successful and failed unlearning. In particular, we observe that retraining-based and fine-tuning-based methods can achieve effective unlearning, even when the target data remain in the original dataset. In contrast, de-optimization-based methods fail to achieve true unlearning and instead degrade the model's performance. Fisher/Hessian-based methods also fail to unlearn requested data, even formal certification is provided. Moreover, we show that our framework is robust against fake unlearning attempts and generalizes well to large language models.