Segev Shlomov, Ben wiesel, Aviad Sela, Ido Levy, Liane Galanti, Roy Abitbol
{"title":"从接地到规划:网络代理瓶颈的基准测试","authors":"Segev Shlomov, Ben wiesel, Aviad Sela, Ido Levy, Liane Galanti, Roy Abitbol","doi":"arxiv-2409.01927","DOIUrl":null,"url":null,"abstract":"General web-based agents are increasingly essential for interacting with\ncomplex web environments, yet their performance in real-world web applications\nremains poor, yielding extremely low accuracy even with state-of-the-art\nfrontier models. We observe that these agents can be decomposed into two\nprimary components: Planning and Grounding. Yet, most existing research treats\nthese agents as black boxes, focusing on end-to-end evaluations which hinder\nmeaningful improvements. We sharpen the distinction between the planning and\ngrounding components and conduct a novel analysis by refining experiments on\nthe Mind2Web dataset. Our work proposes a new benchmark for each of the\ncomponents separately, identifying the bottlenecks and pain points that limit\nagent performance. Contrary to prevalent assumptions, our findings suggest that\ngrounding is not a significant bottleneck and can be effectively addressed with\ncurrent techniques. Instead, the primary challenge lies in the planning\ncomponent, which is the main source of performance degradation. Through this\nanalysis, we offer new insights and demonstrate practical suggestions for\nimproving the capabilities of web agents, paving the way for more reliable\nagents.","PeriodicalId":501315,"journal":{"name":"arXiv - CS - Multiagent Systems","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"From Grounding to Planning: Benchmarking Bottlenecks in Web Agents\",\"authors\":\"Segev Shlomov, Ben wiesel, Aviad Sela, Ido Levy, Liane Galanti, Roy Abitbol\",\"doi\":\"arxiv-2409.01927\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"General web-based agents are increasingly essential for interacting with\\ncomplex web environments, yet their performance in real-world web applications\\nremains poor, yielding extremely low accuracy even with state-of-the-art\\nfrontier models. We observe that these agents can be decomposed into two\\nprimary components: Planning and Grounding. Yet, most existing research treats\\nthese agents as black boxes, focusing on end-to-end evaluations which hinder\\nmeaningful improvements. We sharpen the distinction between the planning and\\ngrounding components and conduct a novel analysis by refining experiments on\\nthe Mind2Web dataset. Our work proposes a new benchmark for each of the\\ncomponents separately, identifying the bottlenecks and pain points that limit\\nagent performance. Contrary to prevalent assumptions, our findings suggest that\\ngrounding is not a significant bottleneck and can be effectively addressed with\\ncurrent techniques. Instead, the primary challenge lies in the planning\\ncomponent, which is the main source of performance degradation. Through this\\nanalysis, we offer new insights and demonstrate practical suggestions for\\nimproving the capabilities of web agents, paving the way for more reliable\\nagents.\",\"PeriodicalId\":501315,\"journal\":{\"name\":\"arXiv - CS - Multiagent Systems\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multiagent Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.01927\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multiagent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01927","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
From Grounding to Planning: Benchmarking Bottlenecks in Web Agents
General web-based agents are increasingly essential for interacting with
complex web environments, yet their performance in real-world web applications
remains poor, yielding extremely low accuracy even with state-of-the-art
frontier models. We observe that these agents can be decomposed into two
primary components: Planning and Grounding. Yet, most existing research treats
these agents as black boxes, focusing on end-to-end evaluations which hinder
meaningful improvements. We sharpen the distinction between the planning and
grounding components and conduct a novel analysis by refining experiments on
the Mind2Web dataset. Our work proposes a new benchmark for each of the
components separately, identifying the bottlenecks and pain points that limit
agent performance. Contrary to prevalent assumptions, our findings suggest that
grounding is not a significant bottleneck and can be effectively addressed with
current techniques. Instead, the primary challenge lies in the planning
component, which is the main source of performance degradation. Through this
analysis, we offer new insights and demonstrate practical suggestions for
improving the capabilities of web agents, paving the way for more reliable
agents.