Phi-Ground Tech Report: Advancing Perception in GUI Grounding

Microsoft
August 2025

Left: The comparison chart of our grounding model results across five GUI grounding benchmarks. Our model, trained specifically for the agent setting, achieved SOTA results on all benchmarks under this focus. Even in the general end-to-end model setting, our model attained SOTA results on three of the benchmarks. Right: The relationship between model performance and computational cost on ScreenSpot-pro demonstrates that our model supports the Pareto frontier, indicating its efficiency. Most GUI research traditionally considers only the parameter count N for comparison, but our experiments highlight that computational cost during testing, such as the number of image tokens, also significantly impacts performance. The X-axis in the right figure represents ND, where D is the number of image tokens. Training and inference latency are more linearly correlated with ND than with N. A graph using latency as the X-axis closely resembles the right figure, but latency is often influenced by hardware and acceleration libraries such as vllm, so we did not use latency as X-axis.

Abstract

With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from "Iron Man", are becoming a reality. GUI grounding is a crucial step for CUAs to perform concrete actions, as it determines the coordinates for clicks and other interactions. Current end-to-end grounding models still achieve less than 80% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment, as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining every detail from data collection to model training. Ultimately, we developed the Phi-Ground model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, have generalization potential for other perception tasks.

Key results

BibTeX


@article{zhang2025phi,
  title={Phi-Ground Tech Report: Advancing Perception in GUI Grounding},
  author={Zhang, Miaosen and Xu, Ziqiang and Zhu, Jialiang and Dai, Qi and Qiu, Kai and Yang, Yifan and Luo, Chong and Chen, Tianyi and Wagle, Justin and Franklin, Tim and others},
  journal={arXiv preprint arXiv:2507.23779},
  year={2025}
}