Agent trajectories as programs: fingerprinting and programming coding-agent behavior

2026-06-15Software Engineering

Software EngineeringMachine Learning
AI summary

The authors introduce a way to compare different AI agents by looking at their unique behavior patterns, called fingerprints, rather than just their overall scores. They tested ten agents and could correctly identify which agent made a given decision 85.7% of the time, even across different tasks. Their method uses a special technique to represent how agents solve problems in a compact but revealing way. They also created a tool called ProcGrep to help analyze how agents work step-by-step, which can be useful for developers managing AI coding assistants. The authors suggest this approach can give a deeper understanding of AI behavior beyond just success rates.

AI agentsbehavioral fingerprintsprocedural representationsemergent vocabulary inductiontrajectory analysisJensen-Shannon divergenceagent evaluationdistilled modelsSWE-Bench datasetProcGrep
Authors
Hamidah Oderinwale
Abstract
Benchmark scores tell you what an agent got right; they do not tell you how it got there. In this work, we introduce methods for comparing agents procedurally in different contexts, where the model, tasks, and approaches vary. We compare ten agents and find that they are identifiable by their behavioral habits, which we define as fingerprints: a probe over these procedural signatures attributes an unseen trajectory to the correct agent at 85.7% accuracy, controlling for leakage across tasks. We develop procedural representations for agent problem-solving procedures with an emergent vocabulary induction technique that is meant to be maximally compressive to avoid surface-level variation while being expressive enough to unveil the quirks of the models' patterns. We apply our framework to the software engineering evaluation dataset SWE-Bench to study the structural distinctness of agent trajectories and find that behavior is most similar between models from similar release periods and those that are distilled from one another (e.g., a distilled student model and its teacher have a Jensen-Shannon divergence of 0.25, about half the distance between other model pairs). As more models saturate evaluations, we believe that it will be important to probe model behavior along more holistic dimensions than success rates alone. We introduce ProcGrep, a library for auditing and evaluating agents for how they approach tasks at a procedural level given their traces in a top-down fashion. We believe this work has a range of applications to help developers work with and program coding agents, such as task-aware model routing, agent monitoring, and finer-grained cost analysis.