Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

arXiv - CS - Artificial Intelligence Pub Date : 2024-09-12 DOI:arxiv-2409.08264

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui

{"title":"Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale","authors":"Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui","doi":"arxiv-2409.08264","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) show remarkable potential to act as computer\nagents, enhancing human productivity and software accessibility in multi-modal\ntasks that require planning and reasoning. However, measuring agent performance\nin realistic environments remains a challenge since: (i) most benchmarks are\nlimited to specific modalities or domains (e.g. text-only, web navigation, Q&A,\ncoding) and (ii) full benchmark evaluations are slow (on order of magnitude of\ndays) given the multi-step sequential nature of tasks. To address these\nchallenges, we introduce the Windows Agent Arena: a reproducible, general\nenvironment focusing exclusively on the Windows operating system (OS) where\nagents can operate freely within a real Windows OS and use the same wide range\nof applications, tools, and web browsers available to human users when solving\ntasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse\nWindows tasks across representative domains that require agent abilities in\nplanning, screen understanding, and tool usage. Our benchmark is scalable and\ncan be seamlessly parallelized in Azure for a full benchmark evaluation in as\nlittle as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we\nalso introduce a new multi-modal agent, Navi. Our agent achieves a success rate\nof 19.5% in the Windows domain, compared to 74.5% performance of an unassisted\nhuman. Navi also demonstrates strong performance on another popular web-based\nbenchmark, Mind2Web. We offer extensive quantitative and qualitative analysis\nof Navi's performance, and provide insights into the opportunities for future\nresearch in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08264","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena

查看原文本刊更多论文

Windows 代理竞技场：大规模评估多模式操作系统代理

大型语言模型（LLMs）显示出作为计算机代理的巨大潜力，可在需要规划和推理的多模式任务中提高人类的工作效率和软件的可访问性。然而，衡量代理在现实环境中的性能仍然是一项挑战，因为：(i) 大多数基准仅限于特定的模式或领域（如纯文本、网络导航、问答、编码）；(ii) 鉴于任务的多步骤连续性，完整的基准评估非常缓慢（以天为单位）。为了应对这些挑战，我们引入了 Windows Agent Arena：这是一个专门针对 Windows 操作系统（OS）的可重现的通用环境，在这里，Agent 可以在真实的 Windows 操作系统中自由操作，并在解决任务时使用与人类用户相同的各种应用程序、工具和网络浏览器。我们调整了 OSWorld 框架（Xie 等人，2024 年），创建了 150 多个具有代表性的 Windows 任务，这些任务要求代理具备规划、屏幕理解和工具使用方面的能力。我们的基准具有可扩展性，可以在 Azure 中进行无缝并行化，在短短 20 分钟内即可完成完整的基准评估。为了展示 Windows Agent Arena 的能力，我们还引入了一个新的多模式代理 Navi。我们的代理在 Windows 领域的成功率为 19.5%，而无人协助的成功率为 74.5%。Navi 还在另一个流行的基于网络的基准测试 Mind2Web 中表现出色。我们对 Navi 的性能进行了广泛的定量和定性分析，并深入探讨了使用 Windows Agent Arena 进行代理开发和数据生成的未来研究机会。网页：https://microsoft.github.io/WindowsAgentArena 代码：https://github.com/microsoft/WindowsAgentArena

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Artificial Intelligence

自引率

0.00%

发文量