The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

2026-06-08Computation and Language

Computation and Language
AI summary

The authors studied how reinforcement learning from human feedback (RLHF) changes a language model’s political bias. They found that RLHF doesn't erase the model's partisan views but instead makes the model hide or ignore these biases when responding. This means the model stays neutral on the surface but still has the underlying partisan info inside. The authors suggest that this way of aligning models could make them seem safe, but their hidden biases might still influence behavior in some cases.

Reinforcement Learning from Human Feedback (RLHF)Large Language ModelsAlignmentPartisan Political OrientationInternal RepresentationsLlama 3.1Sparse AutoencoderFeature-level SteeringModel NeutralityValue Alignment
Authors
Wendy K. Tam
Abstract
The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human values.'' Yet the process is opaque. What values are being encoded; whose values are they; and how does RLHF encode them? A growing body of evidence suggests that RLHF produces only functional compliance rather than deep alignment. We offer a mechanistic case study of this phenomenon for partisan political orientation with a comparison of the internal representations of Llama 3.1 8B before and after RLHF. We show that RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model's knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry that enables partisan steering remains intact. The mechanisms that bypass RLHF's guardrails, such as inferring and amplifying a user's partisan identity, reactivate partisan generation. If RLHF operates by disconnecting rather than removing value-laden structure, then the same pattern may hold for other value domains, and the aligned model's behavior may be more fragile than its outputs suggest.