Litmus: Zero-Label, Code-Driven Metric Specification for Evaluating AI Systems
2026-06-22 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors created Litmus, a system that automatically figures out what to measure when evaluating AI pipelines by looking at the code and asking targeted questions, without needing labeled data. Instead of just applying fixed metrics, Litmus identifies what's important to check and builds a set of evaluation metrics tailored to each stage of the AI process. They tested Litmus on three real AI applications and found it covered more concerns, had less overlap in metrics, and matched quality labels better than other methods. This work suggests that evaluation should start by defining what matters, not just by choosing metrics blindly.
Agentic LLM systemsEvaluation metricsAI pipelinesZero-label evaluationMetric specificationAutomatic metric designSpearman correlationScientific QAFinancial account groupingRisk assessment
Authors
Prajjwal Gupta, Prasang Gupta, Vishal Bhutani, Apoorva Sharma, Sumanth Chundru, Waqar Sarguroh, Kevin Paul
Abstract
As agentic LLM systems move from prototypes to deployment across increasingly diverse domains, evaluating them has become both more important and more difficult. The challenge is not only that individual metrics may be unreliable, but that evaluation goals are often left implicit. Without a clear account of what a system is expected to do, how it can fail, and which failures matter, metric choices become difficult to justify, interpret, or validate. We present Litmus, a zero-label system that designs evaluation and monitoring metrics for AI pipelines by eliciting evaluation intent from source code and targeted interrogation. Instead of assuming that the evaluation target is already known, Litmus first identifies what must be measured and why, then converts those answers into constraints for constructing a justified, per-stage metric portfolio. We evaluate Litmus on three real, code-defined AI pipelines - financial account grouping, scientific QA, and inherent risk assessment - against AutoMetrics and three DynamicRubric baselines. Litmus achieves the broadest or tied-broadest concern coverage, spans more pipeline stages, produces a near-zero-redundancy portfolio, and ranks first in validity against per-row quality labels on all three pipelines - decisively on scientific QA (Spearman $ρ=0.72$ vs. less than $0.47$ for every baseline), and within overlapping confidence intervals in relation to two components of the audit framework despite using no labels during metric design. Our results support a shift from automatic metric implementation to automatic metric specification: before asking which metric to compute, evaluation systems should ask what must be measured and why.