Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui
{"title":"Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale","authors":"Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui","doi":"arxiv-2409.08264","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) show remarkable potential to act as computer\nagents, enhancing human productivity and software accessibility in multi-modal\ntasks that require planning and reasoning. However, measuring agent performance\nin realistic environments remains a challenge since: (i) most benchmarks are\nlimited to specific modalities or domains (e.g. text-only, web navigation, Q&A,\ncoding) and (ii) full benchmark evaluations are slow (on order of magnitude of\ndays) given the multi-step sequential nature of tasks. To address these\nchallenges, we introduce the Windows Agent Arena: a reproducible, general\nenvironment focusing exclusively on the Windows operating system (OS) where\nagents can operate freely within a real Windows OS and use the same wide range\nof applications, tools, and web browsers available to human users when solving\ntasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse\nWindows tasks across representative domains that require agent abilities in\nplanning, screen understanding, and tool usage. Our benchmark is scalable and\ncan be seamlessly parallelized in Azure for a full benchmark evaluation in as\nlittle as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we\nalso introduce a new multi-modal agent, Navi. Our agent achieves a success rate\nof 19.5% in the Windows domain, compared to 74.5% performance of an unassisted\nhuman. Navi also demonstrates strong performance on another popular web-based\nbenchmark, Mind2Web. We offer extensive quantitative and qualitative analysis\nof Navi's performance, and provide insights into the opportunities for future\nresearch in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08264","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large language models (LLMs) show remarkable potential to act as computer
agents, enhancing human productivity and software accessibility in multi-modal
tasks that require planning and reasoning. However, measuring agent performance
in realistic environments remains a challenge since: (i) most benchmarks are
limited to specific modalities or domains (e.g. text-only, web navigation, Q&A,
coding) and (ii) full benchmark evaluations are slow (on order of magnitude of
days) given the multi-step sequential nature of tasks. To address these
challenges, we introduce the Windows Agent Arena: a reproducible, general
environment focusing exclusively on the Windows operating system (OS) where
agents can operate freely within a real Windows OS and use the same wide range
of applications, tools, and web browsers available to human users when solving
tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse
Windows tasks across representative domains that require agent abilities in
planning, screen understanding, and tool usage. Our benchmark is scalable and
can be seamlessly parallelized in Azure for a full benchmark evaluation in as
little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we
also introduce a new multi-modal agent, Navi. Our agent achieves a success rate
of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted
human. Navi also demonstrates strong performance on another popular web-based
benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis
of Navi's performance, and provide insights into the opportunities for future
research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena