MIRAGE: Auditing Anti-Muslim Bias in Frontier LLMs Across Reasoning, Agentic, and Time-Coupled Conditions

2026-06-15Machine Learning

Machine Learning
AI summary

The authors created a new test called MIRAGE to check if large language models (LLMs) show bias against Muslims in more realistic settings than just simple prompts. They tested six advanced models in three ways and found that reasoning steps actually increased biases linking Muslims to violence. The models also made more unfair decisions against Muslims in tasks like content moderation and hiring, especially when recent news about conflicts was involved. Attempts to reduce these biases worked for simple prompt tests but not for more complex decision-making. The authors shared their test and tools to help improve fairness in future models.

large language modelsbias evaluationchain-of-thought reasoningagentic decision-makingcontent moderationprompt completionMuslim biasbenchmarkmitigationnews context
Authors
Noor Islam S. Mohammad, Tamim Sheikh
Abstract
Five years after the discovery of persistent anti-Muslim bias in large language models, most evaluations remain confined to single-turn prompt completion, a setting that no longer reflects how frontier LLMs are deployed. We introduce \textbf{MIRAGE} (Muslim-Identity Reasoning and Agentic Generation Evaluation), a benchmark of 1{,}200 prompts spanning three deployment-realistic conditions: direct completion, chain-of-thought reasoning, and simulated agentic decision-making across content moderation, lending triage, refugee claim summarization, and hiring screens. Across six frontier models, we find that (i) chain-of-thought reasoning \emph{amplifies} rather than suppresses Muslim-violence associations by 12--34\% relative to direct completion, (ii) agentic decisions exhibit a 9--22 percentage-point asymmetry between Muslim and matched non-Muslim cases on identical evidence, and (iii) bias is sharply time-coupled to retrieved news context, increasing 18--27\% under recent-conflict retrieval. Existing prompt-based mitigations transfer poorly across our three conditions, suppressing direct-completion bias while leaving agentic asymmetry largely intact. We release MIRAGE and an open evaluation harness to support targeted mitigation research.