With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from "Iron Man", are becoming a reality. GUI grounding is a crucial step for CUAs to perform concrete actions, as it determines the coordinates for clicks and other interactions. Current end-to-end grounding models still achieve less than 80% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment, as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining every detail from data collection to model training. Ultimately, we developed the Phi-Ground model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, have generalization potential for other perception tasks.
@article{zhang2025phi,
title={Phi-Ground Tech Report: Advancing Perception in GUI Grounding},
author={Zhang, Miaosen and Xu, Ziqiang and Zhu, Jialiang and Dai, Qi and Qiu, Kai and Yang, Yifan and Luo, Chong and Chen, Tianyi and Wagle, Justin and Franklin, Tim and others},
journal={arXiv preprint arXiv:2507.23779},
year={2025}
}