Exposing the Illusion of Erasure in Knowledge Editing for LLMs

2026-06-22 • Machine Learning

Machine LearningArtificial IntelligenceCryptography and Security

AI summaryⓘ

The authors studied how Knowledge Editing (KE) techniques update facts in large language models without full retraining. They found these edits often don’t fully remove old information but instead hide it, which can still show up unexpectedly. Their analysis shows that KE changes mainly suppress original facts rather than erase them, and the edited areas are fragile and easy to trick with special prompts or attacks. This suggests current KE methods are not entirely reliable and can be bypassed.

Knowledge EditingLarge Language ModelsLow-rank UpdatesModel RepresentationsLoss LandscapeAdversarial AttacksTargeted SuppressionMechanistic Analysis

Authors

Advik Raj Basani, Anshuman Chhabra

Abstract

Knowledge Editing (KE) has emerged as a frontier for updating specific facts in LLMs without costly retraining, but its reliability and underlying mechanisms remain poorly understood. In this work, we examine KE from an adversarial elicitation perspective, revealing that edited knowledge is often not fully erased and continues to surface, with consistent failures observed across diverse model architectures. To explain this behavior, we conduct a mechanistic analysis of popular KE methods. We show that low-rank updates do not overwrite existing knowledge but instead redistribute it within the model's representation space. Furthermore, we find that these methods act as targeted suppression mechanisms that reduce the likelihood of expressing original facts, rather than removing them from the model. Analysis of the loss landscape reveals that edited knowledge lies in narrow, anisotropic regions that are highly sensitive to perturbations, making them highly vulnerable to indirect prompting and adversarial attacks. By exposing these profound architectural vulnerabilities, our work proves that KE algorithms are inherently bypassable and motivates a fundamental reevaluation of how we deploy post-hoc updates in several LLM applications.

View PDFOpen arXiv