SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories

arXiv - CS - Software Engineering Pub Date : 2024-09-11 DOI:arxiv-2409.07440

Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, Tushar Khot

{"title":"SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories","authors":"Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, Tushar Khot","doi":"arxiv-2409.07440","DOIUrl":null,"url":null,"abstract":"Given that Large Language Models (LLMs) have made significant progress in\nwriting code, can they now be used to autonomously reproduce results from\nresearch repositories? Such a capability would be a boon to the research\ncommunity, helping researchers validate, understand, and extend prior work. To\nadvance towards this goal, we introduce SUPER, the first benchmark designed to\nevaluate the capability of LLMs in setting up and executing tasks from research\nrepositories. SUPERaims to capture the realistic challenges faced by\nresearchers working with Machine Learning (ML) and Natural Language Processing\n(NLP) research repositories. Our benchmark comprises three distinct problem\nsets: 45 end-to-end problems with annotated expert solutions, 152 sub problems\nderived from the expert set that focus on specific challenges (e.g.,\nconfiguring a trainer), and 602 automatically generated problems for\nlarger-scale development. We introduce various evaluation measures to assess\nboth task success and progress, utilizing gold solutions when available or\napproximations otherwise. We show that state-of-the-art approaches struggle to\nsolve these problems with the best model (GPT-4o) solving only 16.3% of the\nend-to-end set, and 46.1% of the scenarios. This illustrates the challenge of\nthis task, and suggests that SUPER can serve as a valuable resource for the\ncommunity to make and measure progress.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"60 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07440","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Given that Large Language Models (LLMs) have made significant progress in writing code, can they now be used to autonomously reproduce results from research repositories? Such a capability would be a boon to the research community, helping researchers validate, understand, and extend prior work. To advance towards this goal, we introduce SUPER, the first benchmark designed to evaluate the capability of LLMs in setting up and executing tasks from research repositories. SUPERaims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges (e.g., configuring a trainer), and 602 automatically generated problems for larger-scale development. We introduce various evaluation measures to assess both task success and progress, utilizing gold solutions when available or approximations otherwise. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios. This illustrates the challenge of this task, and suggests that SUPER can serve as a valuable resource for the community to make and measure progress.

查看原文本刊更多论文

SUPER：评估代理从研究资料库中设置和执行任务的能力

鉴于大型语言模型（LLM）在编写代码方面已经取得了重大进展，现在是否可以利用它们来自主重现研究资料库中的结果？这种能力将为研究界带来福音，帮助研究人员验证、理解和扩展先前的工作。为了向这一目标迈进，我们推出了 SUPER，它是第一个用于评估 LLM 从研究资源库中设置和执行任务的能力的基准。SUPER 试图捕捉使用机器学习（ML）和自然语言处理（NLP）研究资源库的研究人员所面临的现实挑战。我们的基准包括三个不同的问题集：45 个带有专家解决方案注释的端到端问题，152 个从专家集中衍生出来的子问题，这些问题侧重于特定的挑战（例如，配置训练器），以及 602 个自动生成的问题，用于更大规模的开发。我们引入了各种评估方法来评估任务的成功率和进度，在有金牌解决方案的情况下使用金牌解决方案，在没有金牌解决方案的情况下使用近似解决方案。我们的研究表明，最先进的方法在解决这些问题时非常吃力，最佳模型（GPT-4o）只能解决 16.3% 的端到端问题集和 46.1% 的场景问题。这说明了这项任务的挑战性，也表明 SUPER 可以成为社区取得和衡量进展的宝贵资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Software Engineering

自引率

0.00%

发文量