DApp Store | Web3 Hub tapahtumille ja peleille

Trendaavat aiheet

Can AI agents assess the reproducibility of research findings? Our #ACL2025 paper shows that they fall short with REPRO-Bench, a new benchmark that evaluates agents on real-world social science reproducibility tasks of 112 papers, full PDFs, code, and data. Our highest performing agent scores <40%! 1/6

Links here and thread below: Paper: Code: Substack: Medium: 2/6

REPRO-Bench consists of 112 task instances, each built from real-world reproducibility efforts sourced from mass reproduction projects, I4R, Retraction Watch, and reproducibility attempts posted on Twitter/X. Each task includes a paper PDF, reproduction code & data, and a list of major findings. 3/6

We evaluate 3 popular agents on REPRO-Bench. The best-performing agent, CORE-Agent, achieves only 21.4% accuracy, which is lower than random guessing (25%). 4/6

We introduce REPRO-Agent, an improved agent based on our failure analysis. With just four additional instructions, REPRO-Agent boosts accuracy to 36.6%, a 71% relative improvement over the CORE-Agent, while maintaining comparable cost. 5/6

This work is joint with @ChuxuanHu, Austin Peters, and others. 6/6

6,35K

Johtavat

Rankkaus

Suosikit

Ketjussa trendaava

Trendaa X:ssä

Viimeisimmät suosituimmat rahoitukset

Merkittävin