Self-Debias: Self-correcting for Debiasing Large Language Models

2026-04-09Computation and Language

Computation and Language
AI summary

The authors explain that large language models can show biased thinking that builds up over their step-by-step reasoning. To fix this, they created Self-Debias, a method that helps the model correct its own biased reasoning as it goes, by carefully shifting focus from biased parts to more fair ones. Their approach uses a way to guide the model internally without needing constant outside help, improving fairness while keeping its ability to reason well. This method works well even with a small number of labeled examples.

Large Language ModelsChain-of-ThoughtBias PropagationDebiasingSelf-CorrectionProbability Mass RedistributionTrajectory-level ObjectiveConsistency FilteringSupervision SignalsPreference Optimization
Authors
Xuan Feng, Shuai Zhao, Luwei Xiao, Tianlong Gu, Bo An
Abstract
Although Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, inherent social biases often cascade throughout the Chain-of-Thought (CoT) process, leading to continuous "Bias Propagation". Existing debiasing methods primarily focus on static constraints or external interventions, failing to identify and interrupt this propagation once triggered. To address this limitation, we introduce Self-Debias, a progressive framework designed to instill intrinsic self-correction capabilities. Specifically, we reformulate the debiasing process as a strategic resource redistribution problem, treating the model's output probability mass as a limited resource to be reallocated from biased heuristics to unbiased reasoning paths. Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic debiasing constraints. This enables the model to selectively revise biased reasoning suffixes while preserving valid contextual prefixes. Furthermore, we integrate an online self-improvement mechanism utilizing consistency filtering to autonomously synthesize supervision signals. With merely 20k annotated samples, Self-Debias activates efficient self-correction, achieving superior debiasing performance while preserving general reasoning capabilities without continuous external oversight.