An Agentic Approach Towards Replication Package Quality Evaluation

2026-06-01Software Engineering

Software Engineering
AI summary

The authors explore using automated agents to check the quality of research materials shared alongside software engineering studies. They created clear rules based on open science guidelines and programmed agents to evaluate these research packages, particularly focusing on things like code and environment setup. Their tests showed that the agents give consistent and mostly accurate assessments, especially for straightforward, structural checks, but have trouble with complex or mixed-method studies. Researchers found the tool useful but noted some mental effort needed when humans interact with the process. Overall, the study suggests that automated tools can help make reproducibility checks easier and faster.

ReproducibilityArtifact evaluationReplication packageOpen science guidelinesAutomated assessmentMulti-agent systemSoftware engineering researchQualitative studiesMachine-verifiable criteria
Authors
Maximilian Alexander Amougou Mbida, Florian Angermeir
Abstract
Reproducibility in empirical software engineering relies on complete, accessible, and reusable research artifacts, yet artifact evaluation remains largely manual and difficult to scale. This emerging results paper explores an agentic approach for assessing replication package quality by translating open-science guidelines into machine-verifiable criteria. We consolidate 380 requirements from 34 sources into 51 reproducibility criteria, of which 31 are operationalized for automated artifact-based evaluation. Based on these criteria, we implement a multi-agent prototype that automatically inspects replication packages and produces evidence-grounded improvement reports. A preliminary evaluation on five replication packages shows high inter-run consistency of 91.4\% and 75.4\% correctness, through micro-averaged agreement with a manual baseline. The agent performs best on structural criteria such as code, environment, and artifact availability, but struggles with qualitative or mixed-method studies. A pilot survey with seven software engineering researchers indicates well perceived usefulness and adoption potential, while revealing cognitive load in the human-in-the-loop planning step. Overall, these emerging results indicate that agentic research artifact evaluation has the potential to support authors and reviewers by automating selected routine checks.