CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

arXiv - CS - Multiagent Systems Pub Date : 2024-09-17 DOI:arxiv-2409.11363

Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan

{"title":"CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark","authors":"Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan","doi":"arxiv-2409.11363","DOIUrl":null,"url":null,"abstract":"AI agents have the potential to aid users on a variety of consequential\ntasks, including conducting scientific research. To spur the development of\nuseful agents, we need benchmarks that are challenging, but more crucially,\ndirectly correspond to real-world tasks of interest. This paper introduces such\na benchmark, designed to measure the accuracy of AI agents in tackling a\ncrucial yet surprisingly challenging aspect of scientific research:\ncomputational reproducibility. This task, fundamental to the scientific\nprocess, involves reproducing the results of a study using the provided code\nand data. We introduce CORE-Bench (Computational Reproducibility Agent\nBenchmark), a benchmark consisting of 270 tasks based on 90 scientific papers\nacross three disciplines (computer science, social science, and medicine).\nTasks in CORE-Bench consist of three difficulty levels and include both\nlanguage-only and vision-language tasks. We provide an evaluation system to\nmeasure the accuracy of agents in a fast and parallelizable way, saving days of\nevaluation time for each run compared to a sequential implementation. We\nevaluated two baseline agents: the general-purpose AutoGPT and a task-specific\nagent called CORE-Agent. We tested both variants using two underlying language\nmodels: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on\nthe hardest task, showing the vast scope for improvement in automating routine\nscientific tasks. Having agents that can reproduce existing work is a necessary\nstep towards building agents that can conduct novel research and could verify\nand improve the performance of other research agents. We hope that CORE-Bench\ncan improve the state of reproducibility and spur the development of future\nresearch agents.","PeriodicalId":501315,"journal":{"name":"arXiv - CS - Multiagent Systems","volume":"55 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multiagent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11363","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.

查看原文本刊更多论文

CORE-Bench：通过计算可重复性代理基准促进已发表研究的可信度

人工智能代理有可能帮助用户完成各种重要任务，包括开展科学研究。为了促进有用代理的开发，我们需要具有挑战性的基准，但更重要的是，这些基准应直接与现实世界中的相关任务相对应。本文介绍了这样一种基准，旨在衡量人工智能代理在处理科学研究中一个重要但却具有惊人挑战性的方面--计算可重复性--时的准确性。这项任务是科学研究过程的基础，涉及使用提供的代码和数据重现研究结果。我们介绍了 CORE-Bench（计算可重现性代理基准），这是一个基于 90 篇科学论文的 270 个任务组成的基准，横跨三个学科（计算机科学、社会科学和医学）。我们提供了一个评估系统，以快速、可并行的方式测量代理的准确性，与顺序实施相比，每次运行可节省数天的评估时间。我们评估了两个基准代理：通用的 AutoGPT 和名为 CORE-Agent 的特定任务代理。我们使用两种底层语言模型对这两种变体进行了测试：GPT-4o和GPT-4o-mini。最好的代理在最难的任务上达到了 21% 的准确率，这表明常规科学任务的自动化还有很大的改进空间。拥有能重现现有工作的代理是建立能进行新颖研究的代理的必要步骤，也能验证和改进其他研究代理的性能。我们希望 CORE-Bench 能够改善可重现性的状况，并促进未来研究代理的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Multiagent Systems

自引率

0.00%

发文量