AppSelectBench

AppSelectBench: Application-Level Tool Selection Benchmark

Tianyi Chen^*, Michael Solodko^*, Sen Wang, Jongwoo Ko, Junheng Hao, Colby Banbury, Sara Abdali, Saeed Amizadeh, Qing Xiao, Yinheng Li, Tianyu Ding, Kamran Ghasedi Dizaji, Suzhen Zheng, Hao Fan, Justin Wagle, Pashmina Cameron, Kazuhito Koishida

* Equal Contribution

Microsoft
November 2025

Abstract

Computer Using Agents (CUAs) increasingly rely on external tools to execute complex, realistic tasks, and for CUAs to operate effectively, application selection—deciding which application to use before invoking fine-grained tools such as APIs—is foundational to initializing the correct environment, avoiding orchestration confusion, and focusing on the relevant context. However, existing benchmarks primarily test fine-grained API selection, leaving open the question of whether models can reason across and choose between different applications. To address this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs that combines a novel user-task generation pipeline producing realistic, diverse, and semantically grounded user intents at scale with unified evaluation protocols spanning random, heuristic, zero-shot, few-shot, and retrieval-augmented settings. AppSelectBench covers one hundred widely used desktop applications, includes more than one hundred thousand realistic tasks, and through extensive experiments across both closed-source and open-source large language models reveals systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. These findings establish AppSelectBench as a foundation for studying and advancing application-level reasoning—an essential yet underexplored capability of intelligent CUAs.

Application Coverage

To ensure fairness and representativeness in evaluation, we systematically curate a diverse set of applications that span the spectrum of everyday computer use. Applications are grouped into twelve high-level categories that reflect distinct modes of human–computer interaction.

User Task Generation

In our user-task generation pipeline, we proceed in four stages. First, we curate an application-specific atomic task database with explicit argument schemas. Second, a composition engine samples and composes these primitives into higher-level workflows. Third, an argument generator instantiates realistic values to con-cretize the abstract workflows. Finally, an instruction narrator synthesizes natural-language user-task instructions byintegrating step-wise and atomic-task descriptions, yielding fluent and realistic prompts

An LLM rephrasing module refines the resulting text, paraphrasing it into fluent, contextually coherent, and user-friendly language. This two-stage realism enhancement, i.e., step-wise drop-out followed by LLM rephrasing, bridges the gap between compositional task generation and realistic natural-language instructions

Experimental Results

We present a comprehensive evaluation of AppSelectBench, encompassing both the process and quality of synthe-sized user tasks and the performance of different evaluation protocols on application-level tool selection. Our experiments provide an integrated view of data quality and model behavior, establishing a reliable foundation for future research on application-level reasoning and its role in advancing computer-using agents.

Accuracy results across all application categories.

Overall application-selection accuracy for baselines and LLMs under zero-shot, few-shot, and retrieval-augmented selection

BibTeX

@article{chen2025appselectbench, title={AppSelectBench: Application-Level Tool Selection Benchmark}, author={Chen, Tianyi and Solodko, Michael and Wang, Sen and Ko, Jongwoo and Hao, Junheng and Banbury, Colby and Abdali, Sara and Amizadeh, Saeed and Xiao, Qing and Li, Yinheng and Ding, Tianyu and Dizaji, Kamran Ghasedi and Zheng, Suzhen and Fan, Hao and Wagle, Justin and Cameron, Pashmina and Koishida, Kazuhito}, journal={arXiv preprint arXiv:2507.23779}, year={2025} }