SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

2026-06-29Artificial Intelligence

Artificial Intelligence
AI summary

The authors explore how AI systems, called guardrails, can detect unsafe behavior by following specific safety rules given in the conversation, instead of using fixed risk categories. They created SafePyramid, a test with 1,000 conversations and many rules to check if guardrails understand simple rules, complex rule connections, and completely new rule sets. Testing top AI models showed that even the best one struggles to correctly identify all rule violations, especially with harder tasks. This shows current guardrails need improvement to better follow and adapt to safety policies.

guardrailsin-context learningpolicy guardrailingsafety benchmarkrule dependencieslarge language modelsGPT-5.5multi-turn conversationspolicy adaptationnatural-language rules
Authors
Jiacheng Zhang, Haoyu He, Sen Zhang, Shen Wang, Xiaolei Xu, Yuhao Sun, Meng Shen, Feng Liu
Abstract
In real-world applications, guardrails are often expected to identify unsafe user-model interactions according to application-specific safety policies, rather than relying on predefined risk taxonomies. In this work, we study this setting under the paradigm of in-context policy guardrailing, where guardrails predict safety violations based on policy specifications provided in context. To systematically evaluate this capability, we introduce SafePyramid, a safety benchmark comprising 1,000 multi-turn conversations across 10 domains and 3,000 corresponding application-specific policies, which together contain 61,699 distinct natural-language rules. SafePyramid organizes the evaluation into three difficulty levels: L0 evaluates individual-rule understanding, L1 evaluates reasoning over rule dependencies, and L2 evaluates adaptation of full novel policy frameworks defined in context. To ensure benchmark quality, we employ a rigorous multi-stage pipeline to construct and validate the benchmark. Using SafePyramid, we evaluate 10 frontier LLMs and 5 policy-configurable guardrails and find that in-context policy guardrailing remains highly challenging: even the best-performing model, GPT-5.5, exactly identifies the full set of violated rules in only 54.0%, 35.3%, and 12.9% cases on L0, L1, and L2, respectively. These results highlight the limitations of current guardrails and call for stronger in-context policy guardrails that can reliably execute policies, resolve rule dependencies, and adapt to novel policy frameworks.