AgentFairBench: Do LLM Agents Discriminate When They Act?

2026-06-15 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors created AgentFairBench, a simple test to check if large language model agents treat people fairly when making decisions in areas like hiring, lending, and medical triage. They use made-up profiles that only differ by race and gender signals in names to see if the model's actions change unfairly. Their method measures differences in decisions with careful statistics and is cheap to run. In tests, one popular model showed no unfair bias beyond random chance, and their tool can reliably detect bias when it exists. They also provide all their code and data openly so others can use and improve the test.

Large Language ModelsFairnessDemographic DisparityBenchmarkCounterfactual TestingBias DetectionHiringLendingMedical TriageStatistical Significance

Authors

Triveni Morla, Rohith Reddy Bellibaltu, Manpreet Singh, Manmeet Singh Kapoor

Abstract

Large language model (LLM) agents increasingly take actions (screening applicants, recommending credit, triaging patients), yet fairness for LLMs is still measured by grading answers. We introduce AgentFairBench, a cheap, reproducible, multi-domain benchmark for demographic disparity in the actions of LLM agents. Grounded in a companion framework, the Bias Conduction Framework (BCF, restated here), it spans three regulator-anchored domains: hiring, lending, and medical triage. Synthetic, demographic-neutral profiles are evaluated in counterfactual matched sets that vary only a name-coded race x gender signal (in the Bertrand Mullainathan tradition), under four agent scaffolds of increasing agency (direct, chain-of-thought, multi-agent deliberation, tool-augmented). A NumPy-only harness computes counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity, with bootstrap confidence intervals, paired tests, and false-discovery-rate control, for single-digit dollars per model. A live leaderboard with a held-out private split and a contamination canary admits external models by submission. Our pilot (864 decisions plus a test-retest replication) carries a methodological lesson: comparing a six-group score spread against a two-run noise difference overstates disparity by ~ 2.4X through statistic arity alone. Against an arity matched noise floor and an omnibus group test, claude haiku 4 5 shows no demographic effect above sampling noise (0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction); a planted-bias test confirms the instrument detects disparity when present. The contribution is a sound, sensitive, adoption-ready instrument, the arity matched null methodology, and open artifacts to scale it. Code, data, and harness are released under open licenses, with an anonymized review artifact.

View PDFOpen arXiv