Windows Agent Arena

Evaluating Multi-Modal OS Agents at Scale

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida
Microsoft

Carnegie Mellon University
Zack Hui
Columbia University

We built a scalable open-sourced framework to test and develop AI agents that can reason, plan and act on a PC using language models

Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since:

  • (i) most benchmarks are limited to specific modalities or domains (e.g., text-only, web navigation, Q&A, coding) and
  • (ii) full benchmark evaluations are slow (on the order of magnitude of days) given the multi-step sequential nature of tasks.

To address these challenges, we introduce the WindowsAgentArena (WAA): a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is also scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes.

To demonstrate WAA's capabilities, we also introduce a new multi-modal agent, Navi, showing it can achieve a success rate of 19.5%, compared to 74.5% for human performance. In addition, we show Navi's strong performance on another popular web-based benchmark, Mind2Web.

We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the challenges and opportunities for future research in agent development and data generation using WAA.

Windows Agent Arena tasks:

Our initial release consists of 154 diverse tasks spanning applications which represent the typical user workloads within Windows OS: editing documents and spreadsheets (LibreOffice Calc /Writer), browsing the internet (Microsoft Edge, Google Chrome), Windows system tasks (File Explorer, Settings), coding (Visual Studio Code), watching videos (VLC Player), and utility functions (Notepad, Clock, Paint):

Tasks

Task evaluation is deterministic, and we use custom scripts to generate a reward at the end of each episode:

Eval

Azure agent parallelization: results in minutes, not days:

We designed the infrastructure behind Windows Agent Arena to support flexible, local execution during the prototyping phase as well as scalable and secure cloud parallelization in Azure. The core of our system is a Docker container that hosts the Windows 11 VM. Within the container, we deploy a client process for task scheduling and configuration as well as the agent and the evaluation scripts. The VM is our main simulation environment: a Python Flask server acts as the bridge between the container and the VM by receiving commands from the client processes and executing them within the VM; it also sends observations and files back to the client.

We use Azure Machine Learning jobs to parallelize the benchmark evaluation using compute instances. The process is similar to the local setup, but the VMs are instantiated and terminated with each experiment submission. We use the Azure Blob Store to manage the Windows 11 snapshot and output logs while the code is pre-configured in the Docker image. Tasks are distributed evenly among the workers, and the results are aggregated at the end of the run.

Deploy

Navi, an agent for Windows navigation:

We use chain-of-thought prompting to instruct our agent, Navi, to reason about the current state of the computer, its own past actions, and decide on the most appropriate next action. Our agent receive as input the title of the current foreground window, titles for all other windows or browser tabs currently open, and a representation of the current screen. We consider several methods to process the screen representation for the agent as input and create Set-of-Marks (SoMs):

  • UIA tree parsing: extracts the visible elements from the Windows UI Automation tree
  • DOM tree parsing: extracts the visible elements from the DOM tree (browser only)
  • OCR: proprietary and open models (Tesseract)
  • Icon and image detection: proprietary and open models (Grounding DINO)
  • OmniParser: proprietary model that detects detects text, icons, and images and provides icon captioning

Below you can see a step-by-step example of Navi's reasoning process and screen parsing during a task:

Agent Example

Results:

We benchmark several state-of-the-art visual-language model agent configurations. We find that all existing models achieve low performance in comparison to human behavior, with large variance between domains.

The quality of the Set-of-Marks plays a crucial role in the agent's performance. Agents that rely only on pixel-based OCR and icon detection achieve lower performance than those that also use the UIA tree. We also find that Omniparser's icon captioning capability boots performance.

Results

BibTeX citation

@article{bonatti2024arena,
  author    = {Bonatti, Rogerio and Zhao, Dan and Bonacci, Francesco and Dupont, Dillon, and Abdali, Sara and Li, Yinheng and Lu, Yadong and Wagle, Justin and Koishida, Kazuhito and Bucker, Arthur and Jang, Lawrence and Hui, Zack},
  title     = {Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale},
  institution = {Microsoft},
  year      = {2024},
  month = {September}, 
}