Emergent alignment and the projectability of ethical personas

2026-06-08 • Artificial Intelligence

Artificial IntelligenceMachine Learning

AI summaryⓘ

The authors studied how tuning language models on specific ethical tasks can cause the models to behave in aligned or misaligned ways. They tested four different ethical frameworks and found that focusing on even narrow safety issues can improve the model's behavior across broader safety topics. The authors also showed that these models adopt distinct ethical 'personas' depending on the framework used during tuning. They suggest that evaluating alignment should consider not only safety performance but also how well the model consistently projects its intended ethical stance.

language modelsfine-tuningalignmentemergent misalignmentConstitutional AIethical personasdeontologyconsequentialismvirtue ethicsprojectability

Authors

Guillermo Del Pinal, Youngchan Lee, Cameron McNamara, Alejandro Perez Carballo

Abstract

Work on `emergent misalignment' shows that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the `persona selection' (PSM) hypothesis: during pre-training, LLMs learn to simulate different characters and perspectives, which can be elicited and refined during post-training. This paper investigates the converse phenomenon, `emergent alignment', and uses it to support and refine the PSM and motivate a novel desideratum for alignment. We finetune a helpful-only model on broad and narrow safety tasks. To create SFT samples, we follow the `Constitutional AI' (CAI) approach and use four constitutions which encode reasonable alignment strategies: deontology, consequentialism, virtue ethics, and aligning AIs as subordinate to human authority. For each of those models, we show that finetuning on two narrow safety sub-categories reliably induces emergent alignment over a representative set of general safety categories, and on safety subcategories that we directly filtered-out of the data sets used for narrow alignment. To test the `PSM' using a more fine-grained evaluation, we used a multidimensional `ethical persona' diagnostic. For each constitutionally finetuned (broad/narrow) model, we evaluate how well their behavior matches their expected signature profile. Our results show that our CAI models acquire their expected ``ethical persona'' -- e.g., the model narrowly fine-tuned on SFT samples created using the consequentialist constitution agrees significantly more with utilitarian than deontological beliefs. Yet our coarse and fine-grained evaluations show that there are significant differences across our (broad/narrow) finetuned CAI models in how well they project. We conclude that alignment strategies should be evaluated, not just on their (in-distribution) general safety performance, but also specifically on their degree of projectability.

View PDFOpen arXiv