AI-Assisted Systematization for Evaluating GenAI Systems

2026-05-25 • Computation and Language

Computation and LanguageArtificial IntelligenceComputers and Society

AI summaryⓘ

The authors discuss how it's hard to evaluate AI systems because important ideas like "reasoning" or "fairness" are vague and hard to measure. They suggest a step called systematization, which means turning these broad ideas into clear, measurable parts. To make this easier, they explore using AI to help with systematization by creating tools that organize and check these concepts. They test their approach on two ideas: hate-based rhetoric and digital empathy, checking if their AI-made descriptions are accurate and useful.

generative AIevaluationsystematizationconcept specificationzero-shotmulti-agent systemscontent validityinformation recoverabilityhate-based rhetoricdigital empathy

Authors

Dhruv Agarwal, Emily Sheng, Chad Atalla, Jean Garcia-Gathright, Hussein Mozannar, Hannah Washington, Alexandra Chouldechova, Solon Barocas, Hanna Wallach

Abstract

Evaluating generative AI (GenAI) systems is challenging because many targets of evaluation are broad, contested concepts, such as "reasoning," "fairness," or "creativity." When these concepts are left underspecified, it becomes unclear what should be measured or how evaluation results should be interpreted. This problem reflects a missing step: systematization, that is, moving from a broad background concept to an explicit, structured account of the concept in measurable terms. To help address the fact that systematization is cognitively demanding and resource-intensive, we investigate whether AI assistance can support this process. To enable AI-assisted systematization and assess its quality, we introduce a structured representation of a systematized concept, a concept spec, and a validation worksheet. We then develop two AI-assisted systematizers: a direct, zero-shot approach and a multi-agent approach that more closely mirrors manual systematization approaches from existing literature. We use these systematizers to produce concept specs for two concepts -- hate-based rhetoric and digital empathy -- and evaluate resulting concept specs on content validity and information recoverability.

View PDFOpen arXiv