DSBench:数据科学代理离成为数据科学专家还有多远?

Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, Dong Yu
{"title":"DSBench:数据科学代理离成为数据科学专家还有多远?","authors":"Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, Dong Yu","doi":"arxiv-2409.07703","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have\ndemonstrated impressive language/vision reasoning abilities, igniting the\nrecent trend of building agents for targeted applications such as shopping\nassistants or AI software engineers. Recently, many data science benchmarks\nhave been proposed to investigate their performance in the data science domain.\nHowever, existing data science benchmarks still fall short when compared to\nreal-world data science applications due to their simplified settings. To\nbridge this gap, we introduce DSBench, a comprehensive benchmark designed to\nevaluate data science agents with realistic tasks. This benchmark includes 466\ndata analysis tasks and 74 data modeling tasks, sourced from Eloquence and\nKaggle competitions. DSBench offers a realistic setting by encompassing long\ncontexts, multimodal task backgrounds, reasoning with large data files and\nmulti-table structures, and performing end-to-end data modeling tasks. Our\nevaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle\nwith most tasks, with the best agent solving only 34.12% of data analysis tasks\nand achieving a 34.74% Relative Performance Gap (RPG). These findings\nunderscore the need for further advancements in developing more practical,\nintelligent, and autonomous data science agents.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?\",\"authors\":\"Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, Dong Yu\",\"doi\":\"arxiv-2409.07703\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have\\ndemonstrated impressive language/vision reasoning abilities, igniting the\\nrecent trend of building agents for targeted applications such as shopping\\nassistants or AI software engineers. Recently, many data science benchmarks\\nhave been proposed to investigate their performance in the data science domain.\\nHowever, existing data science benchmarks still fall short when compared to\\nreal-world data science applications due to their simplified settings. To\\nbridge this gap, we introduce DSBench, a comprehensive benchmark designed to\\nevaluate data science agents with realistic tasks. This benchmark includes 466\\ndata analysis tasks and 74 data modeling tasks, sourced from Eloquence and\\nKaggle competitions. DSBench offers a realistic setting by encompassing long\\ncontexts, multimodal task backgrounds, reasoning with large data files and\\nmulti-table structures, and performing end-to-end data modeling tasks. Our\\nevaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle\\nwith most tasks, with the best agent solving only 34.12% of data analysis tasks\\nand achieving a 34.74% Relative Performance Gap (RPG). These findings\\nunderscore the need for further advancements in developing more practical,\\nintelligent, and autonomous data science agents.\",\"PeriodicalId\":501479,\"journal\":{\"name\":\"arXiv - CS - Artificial Intelligence\",\"volume\":\"7 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07703\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07703","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

大型语言模型(LLMs)和大型视觉语言模型(LVLMs)已经展示了令人印象深刻的语言/视觉推理能力,从而引发了为购物助手或人工智能软件工程师等目标应用构建代理的新趋势。最近,人们提出了许多数据科学基准,以研究它们在数据科学领域的性能。然而,现有的数据科学基准由于设置简化,与真实世界的数据科学应用相比仍有不足。为了弥补这一不足,我们引入了 DSBench,这是一个综合性基准,旨在通过现实任务评估数据科学代理。该基准包括 466 项数据分析任务和 74 项数据建模任务,均来自 Eloquence 和 Kaggle 竞赛。DSBench 提供了一个逼真的环境,包括长上下文、多模式任务背景、大型数据文件和多表结构推理,以及执行端到端数据建模任务。对最先进的 LLM、LVLM 和代理的评估表明,它们在大多数任务中都很吃力,最好的代理只能解决 34.12% 的数据分析任务,相对性能差距 (RPG) 为 34.74%。这些发现表明,需要进一步开发更实用、更智能、更自主的数据科学代理。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?
Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing data science benchmarks still fall short when compared to real-world data science applications due to their simplified settings. To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG). These findings underscore the need for further advancements in developing more practical, intelligent, and autonomous data science agents.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信