Benchmark Overview

UFO² is rigorously benchmarked on two publicly‑available live‑task suites:

Benchmark Scope
Windows Agent Arena (WAA) 154 real Windows tasks across 15 applications (Office, Edge, File Explorer, VS Code, …)
OSWorld (Windows) 49 cross‑application tasks that mix Office 365, browser and system utilities

The integration of these benchmarks into UFO² is in separate repositories. Please follow the above documents for more details.

Note

we have revised the verification scripts of some cases to ensure the correctness of the results.