Exclusive Unlearning

2026-04-07 • Computation and Language

Computation and Language

AI summaryⓘ

The authors address the problem of Large Language Models (LLMs) producing harmful content when used in fields like healthcare and education. Instead of trying to remove specific harmful items one by one, they introduce Exclusive Unlearning (EU), a method that erases almost everything harmful except the important knowledge the model needs to keep. Their approach helps create safer models that avoid bad outputs, including tricky 'jailbreak' attempts, while still being useful for specialized tasks like medicine and math. This helps balance safety with the model's usefulness.

Large Language ModelsMachine UnlearningHarmful ContentExclusive UnlearningModel SafetyJailbreaksKnowledge RetentionHealthcare AIInstruction Following

Authors

Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, Yohei Oseki, Masaru Isonuma

Abstract

When introducing Large Language Models (LLMs) into industrial applications, such as healthcare and education, the risk of generating harmful content becomes a significant challenge. While existing machine unlearning methods can erase specific harmful knowledge and expressions, diverse harmful content makes comprehensive removal difficult. In this study, instead of individually listing targets for forgetting, we propose Exclusive Unlearning (EU), which aims for broad harm removal by extensively forgetting everything except for the knowledge and expressions we wish to retain. We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to specific domains such as medicine and mathematics.

View PDFOpen arXiv