Computer Using Agents (CUAs) increasingly rely on external tools to execute complex, realistic tasks, and for CUAs to operate effectively, application selection—deciding which application to use before invoking fine-grained tools such as APIs—is foundational to initializing the correct environment, avoiding orchestration confusion, and focusing on the relevant context. However, existing benchmarks primarily test fine-grained API selection, leaving open the question of whether models can reason across and choose between different applications. To address this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs that combines a novel user-task generation pipeline producing realistic, diverse, and semantically grounded user intents at scale with unified evaluation protocols spanning random, heuristic, zero-shot, few-shot, and retrieval-augmented settings. AppSelectBench covers one hundred widely used desktop applications, includes more than one hundred thousand realistic tasks, and through extensive experiments across both closed-source and open-source large language models reveals systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. These findings establish AppSelectBench as a foundation for studying and advancing application-level reasoning—an essential yet underexplored capability of intelligent CUAs.
@article{chen2025appselectbench,
title={AppSelectBench: Application-Level Tool Selection Benchmark},
author={Chen, Tianyi and Solodko, Michael and Wang, Sen and Ko, Jongwoo and Hao, Junheng and Banbury, Colby and Abdali, Sara and Amizadeh, Saeed and Xiao, Qing and Li, Yinheng and Ding, Tianyu and Dizaji, Kamran Ghasedi and Zheng, Suzhen and Fan, Hao and Wagle, Justin and Cameron, Pashmina and Koishida, Kazuhito},
journal={arXiv preprint arXiv:2507.23779},
year={2025}
}