Model Success Rate (%) Average Score (%) Average Steps Max Steps
Human Baseline 80 90 - -
GPT-5.5 (Medium) 63 78 13.70 30
API Baseline (Claude-4.5-Opus) 62 81 - -
Claude-Opus-4.8 (High) 62 78 14.13 30
Claude-4.5-Opus 45 57 20.96 30
Claude-4-Sonnet 42 53 12.31 30
Computer-Use-Preview 38 49 21.68 30
OpenCUA-32B 28 42 16.27 30
OpenCUA-7B 24 36 19.22 30
Qwen3-VL-8B 16 27 22.58 30
Qwen3-VL-32B 14 23 23.24 30