Can AI agents assess the reproducibility of research findings? Our #ACL2025 paper shows that they fall short with REPRO-Bench, a new benchmark that evaluates agents on real-world social science reproducibility tasks of 112 papers, full PDFs, code, and data. Our highest performing agent scores <40%! 1/6
Links here and thread below: Paper: Code: Substack: Medium: 2/6
REPRO-Bench consists of 112 task instances, each built from real-world reproducibility efforts sourced from mass reproduction projects, I4R, Retraction Watch, and reproducibility attempts posted on Twitter/X. Each task includes a paper PDF, reproduction code & data, and a list of major findings. 3/6
We evaluate 3 popular agents on REPRO-Bench. The best-performing agent, CORE-Agent, achieves only 21.4% accuracy, which is lower than random guessing (25%). 4/6
We introduce REPRO-Agent, an improved agent based on our failure analysis. With just four additional instructions, REPRO-Agent boosts accuracy to 36.6%, a 71% relative improvement over the CORE-Agent, while maintaining comparable cost. 5/6
This work is joint with @ChuxuanHu, Austin Peters, and others. 6/6
6,35K