IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies

2026-06-29Computation and Language

Computation and Language
AI summary

The authors studied how large language models struggle to follow the right instructions when given multiple inputs with different priority levels, often listening to less important instructions instead of the main ones. They created a method called IHDec that detects when the model is ignoring the correct instruction hierarchy and then adjusts the decoding process to fix this problem without needing to retrain the model. Their tests show that IHDec works better than retraining methods for maintaining instruction priorities in conversations and also improves safety against tricky inputs. The method works especially well with bigger models.

Large Language ModelsInstruction HierarchyMulti-turn ContextJensen-Shannon DivergenceContrastive DecodingPrompt InjectionRole-level PrioritiesAdversarial PromptsModel Scaling
Authors
Nicole Geumheon Liu, Haeun Jang, Yonghyun Jun, Hwanhee Lee
Abstract
Large Language Models (LLMs) often fail to maintain instruction hierarchies (IH) when processing multi-source inputs with varying role-level priorities, paradoxically adhering to lower-priority directives during conflicts. While existing defenses mitigate this issue, they are largely restricted to single-turn scenarios and require expensive fine-tuning. In this paper, we formalize this failure mode in multi-turn contexts via a Jensen-Shannon Divergence (JSD) framework, uncovering a pervasive role-influence inversion phenomenon where subordinate inputs override superior roles. To rectify this without training, we propose IHDec (Instruction Hierarchy-steered Decoding). IHDec leverages JSD to automatically detect token-level hierarchy violations and dynamically executes contrastive decoding to suppress misaligned subordinate roles. Extensive evaluations demonstrate that IHDec outperforms training-based baselines in multi-turn conflicts while fully preserving general response quality. Furthermore, IHDec strengthens safety against adversarial prompt injections and exhibits a robust scaling synergy with larger models. The Code is available at https://github.com/nxcolelxu/IHDec.git