Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts
2026-06-22 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors study how small language models used by the Swiss Federal Supreme Court sometimes refuse to process texts about violent or sexual crimes because of strict safety rules (called guardrails). They created a test set called TF-RefusalBench with thousands of examples in multiple languages to see when and why models refuse or give disclaimers. Their findings show that refusals are affected by the model type, language, and prompts, and that just counting refusals doesn’t fully show how trustworthiness is impacted. They also tested ways to reduce refusals, finding that carefully adjusting refusal settings can allow models to handle criminal law texts without losing accuracy.
Large Language ModelsSwiss Federal Supreme CourtCriminal LawModel GuardrailsOver-alignmentTranslationSummarizationTF-RefusalBenchOn-premises ModelsPrompting
Authors
Arthur Wuhrmann, Gaetan Stein, Daniel Brunner, Andrei Kucharavy
Abstract
While the wider applicability of LLMs in the legal field is currently debated due to their reliability and the gravity of any errors, narrow uses with well-understood and mitigated risks have emerged. Notably the Swiss Federal Supreme Court uses small on-premises models for tentative translations and short-passage summarization across the four official languages. However, such usage is challenging in the context of Criminal Law. Since rulings and cases employees work on routinely can contain detailed descriptions of violent and sexual offenses, their legitimate work is compromised by refusals and disclaimers due to the activation of model guardrails (over-alignment). To measure this phenomenon, we introduce TF-RefusalBench, a multilingual benchmark for criminal-law translation and summarization derived from public Swiss Supreme Court rulings. TF-RefusalBench contains 5,200 total prompts across French, German, Italian, and English, corresponding to common task prompts and passages likely to trigger refusal. We then use TF-RefusalBench to show that over-alignment is a multifaceted phenomenon, influenced by the model and the prompt and text languages being processed, and that its impact cannot be evaluated solely from an over-refusal perspective, given the disclaimer's impact on task faithfulness. Finally, we evaluate approaches to enable on-premises LLMs for Criminal Law Tasks, demonstrating that while prompting can be effective, abliteration (refusal directions ablation) eliminates refusal with minimal impact on task performance.