From Grounding to Planning: Benchmarking Bottlenecks in Web Agents

arXiv - CS - Multiagent Systems Pub Date : 2024-09-03 DOI:arxiv-2409.01927

Segev Shlomov, Ben wiesel, Aviad Sela, Ido Levy, Liane Galanti, Roy Abitbol

{"title":"From Grounding to Planning: Benchmarking Bottlenecks in Web Agents","authors":"Segev Shlomov, Ben wiesel, Aviad Sela, Ido Levy, Liane Galanti, Roy Abitbol","doi":"arxiv-2409.01927","DOIUrl":null,"url":null,"abstract":"General web-based agents are increasingly essential for interacting with\ncomplex web environments, yet their performance in real-world web applications\nremains poor, yielding extremely low accuracy even with state-of-the-art\nfrontier models. We observe that these agents can be decomposed into two\nprimary components: Planning and Grounding. Yet, most existing research treats\nthese agents as black boxes, focusing on end-to-end evaluations which hinder\nmeaningful improvements. We sharpen the distinction between the planning and\ngrounding components and conduct a novel analysis by refining experiments on\nthe Mind2Web dataset. Our work proposes a new benchmark for each of the\ncomponents separately, identifying the bottlenecks and pain points that limit\nagent performance. Contrary to prevalent assumptions, our findings suggest that\ngrounding is not a significant bottleneck and can be effectively addressed with\ncurrent techniques. Instead, the primary challenge lies in the planning\ncomponent, which is the main source of performance degradation. Through this\nanalysis, we offer new insights and demonstrate practical suggestions for\nimproving the capabilities of web agents, paving the way for more reliable\nagents.","PeriodicalId":501315,"journal":{"name":"arXiv - CS - Multiagent Systems","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multiagent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01927","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

General web-based agents are increasingly essential for interacting with complex web environments, yet their performance in real-world web applications remains poor, yielding extremely low accuracy even with state-of-the-art frontier models. We observe that these agents can be decomposed into two primary components: Planning and Grounding. Yet, most existing research treats these agents as black boxes, focusing on end-to-end evaluations which hinder meaningful improvements. We sharpen the distinction between the planning and grounding components and conduct a novel analysis by refining experiments on the Mind2Web dataset. Our work proposes a new benchmark for each of the components separately, identifying the bottlenecks and pain points that limit agent performance. Contrary to prevalent assumptions, our findings suggest that grounding is not a significant bottleneck and can be effectively addressed with current techniques. Instead, the primary challenge lies in the planning component, which is the main source of performance degradation. Through this analysis, we offer new insights and demonstrate practical suggestions for improving the capabilities of web agents, paving the way for more reliable agents.

查看原文本刊更多论文

从接地到规划：网络代理瓶颈的基准测试

基于网络的通用代理对于与复杂的网络环境进行交互越来越重要，但它们在实际网络应用中的性能仍然很差，即使使用最先进的前沿模型，准确率也极低。我们发现，这些代理可以分解为两个主要部分：规划和接地。然而，大多数现有研究都将这些代理视为黑盒子，专注于端到端的评估，从而阻碍了有意义的改进。我们进一步区分了规划和接地组件，并通过改进 Mind2Web 数据集上的实验进行了新颖的分析。我们的工作为每个组件分别提出了新的基准，找出了限制代理性能的瓶颈和痛点。与普遍的假设相反，我们的研究结果表明，接地并不是一个重要的瓶颈，目前的技术可以有效地解决这个问题。相反，主要挑战在于规划组件，它是性能下降的主要来源。通过这一分析，我们提出了新的见解，并展示了提高网络代理能力的实用建议，为开发更可靠的代理铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Multiagent Systems

自引率

0.00%

发文量