UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

🔥 Overview

UI-AGILE is a framework designed to enhance Graphical User Interface (GUI) agents at both training and inference stages. It addresses common challenges in Multimodal Large Language Models (MLLMs) such as reasoning designs, ineffective rewards, and visual noise.

Key Features

  • Training Enhancements:
    • Continuous Reward Function: Incentivizes high-precision grounding.
    • "Simple Thinking" Reward: Balances planning depth with execution speed and grounding accuracy.
    • Cropping-based Resampling: Mitigates the sparse reward problem and improves learning on complex tasks.
  • Inference Enhancements:
    • Decomposed Grounding with Selection: Dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts.

UI-AGILE-7B achieves state-of-the-art grounding performance on benchmarks like ScreenSpot-Pro and ScreenSpot-v2 while maintaining strong general agent capabilities.

⭐️ Citation

If you find this project useful, please cite:

@misc{lian2025uiagileadvancingguiagents,
      title={UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding}, 
      author={Shuquan Lian and Yuhang Wu and Jia Ma and Zihan Song and Bingqi Chen and Xiawu Zheng and Hui Li},
      year={2025},
      eprint={2507.22025},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2507.22025}, 
}
Downloads last month
41
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KDEGroup/UI-AGILE-3B

Quantizations
2 models

Paper for KDEGroup/UI-AGILE-3B