SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories

Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, Tushar Khot
{"title":"SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories","authors":"Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, Tushar Khot","doi":"arxiv-2409.07440","DOIUrl":null,"url":null,"abstract":"Given that Large Language Models (LLMs) have made significant progress in\nwriting code, can they now be used to autonomously reproduce results from\nresearch repositories? Such a capability would be a boon to the research\ncommunity, helping researchers validate, understand, and extend prior work. To\nadvance towards this goal, we introduce SUPER, the first benchmark designed to\nevaluate the capability of LLMs in setting up and executing tasks from research\nrepositories. SUPERaims to capture the realistic challenges faced by\nresearchers working with Machine Learning (ML) and Natural Language Processing\n(NLP) research repositories. Our benchmark comprises three distinct problem\nsets: 45 end-to-end problems with annotated expert solutions, 152 sub problems\nderived from the expert set that focus on specific challenges (e.g.,\nconfiguring a trainer), and 602 automatically generated problems for\nlarger-scale development. We introduce various evaluation measures to assess\nboth task success and progress, utilizing gold solutions when available or\napproximations otherwise. We show that state-of-the-art approaches struggle to\nsolve these problems with the best model (GPT-4o) solving only 16.3% of the\nend-to-end set, and 46.1% of the scenarios. This illustrates the challenge of\nthis task, and suggests that SUPER can serve as a valuable resource for the\ncommunity to make and measure progress.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"60 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07440","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Given that Large Language Models (LLMs) have made significant progress in writing code, can they now be used to autonomously reproduce results from research repositories? Such a capability would be a boon to the research community, helping researchers validate, understand, and extend prior work. To advance towards this goal, we introduce SUPER, the first benchmark designed to evaluate the capability of LLMs in setting up and executing tasks from research repositories. SUPERaims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges (e.g., configuring a trainer), and 602 automatically generated problems for larger-scale development. We introduce various evaluation measures to assess both task success and progress, utilizing gold solutions when available or approximations otherwise. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios. This illustrates the challenge of this task, and suggests that SUPER can serve as a valuable resource for the community to make and measure progress.
SUPER:评估代理从研究资料库中设置和执行任务的能力
鉴于大型语言模型(LLM)在编写代码方面已经取得了重大进展,现在是否可以利用它们来自主重现研究资料库中的结果?这种能力将为研究界带来福音,帮助研究人员验证、理解和扩展先前的工作。为了向这一目标迈进,我们推出了 SUPER,它是第一个用于评估 LLM 从研究资源库中设置和执行任务的能力的基准。SUPER 试图捕捉使用机器学习(ML)和自然语言处理(NLP)研究资源库的研究人员所面临的现实挑战。我们的基准包括三个不同的问题集:45 个带有专家解决方案注释的端到端问题,152 个从专家集中衍生出来的子问题,这些问题侧重于特定的挑战(例如,配置训练器),以及 602 个自动生成的问题,用于更大规模的开发。我们引入了各种评估方法来评估任务的成功率和进度,在有金牌解决方案的情况下使用金牌解决方案,在没有金牌解决方案的情况下使用近似解决方案。我们的研究表明,最先进的方法在解决这些问题时非常吃力,最佳模型(GPT-4o)只能解决 16.3% 的端到端问题集和 46.1% 的场景问题。这说明了这项任务的挑战性,也表明 SUPER 可以成为社区取得和衡量进展的宝贵资源。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信