yappertar4 commited on
Commit
7579fdd
·
verified ·
1 Parent(s): a02e1c8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -143
README.md CHANGED
@@ -18,146 +18,3 @@ tags:
18
  - multimodal
19
  ---
20
 
21
- <p align="center">
22
- <img src="images/logo.png"/>
23
- <p>
24
-
25
- <p align="center">
26
- <a href="https://huggingface.co/tencent/POINTS-GUI-G">
27
- <img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace">
28
- </a>
29
- <a href="https://github.com/Tencent/POINTS-GUI">
30
- <img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="GitHub Code">
31
- </a>
32
- <a href="https://huggingface.co/papers/2602.06391">
33
- <img src="https://img.shields.io/badge/Paper-POINTS--GUI--G-d4333f?logo=arxiv&logoColor=white&colorA=cccccc&colorB=d4333f&style=flat" alt="Paper">
34
- </a>
35
- <a href="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view">
36
- <img src="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view">
37
- </a>
38
- </p>
39
-
40
- ## News
41
-
42
- - 🔜 <b>Upcoming:</b> The <b>End-to-End GUI Agent Model</b> is currently under active development and will be released in a subsequent update. Stay tuned!
43
- - 🚀 2026.02.06: We are happy to present <b>POINTS-GUI-G</b>, our specialized GUI Grounding Model. To facilitate reproducible evaluation, we provide comprehensive scripts and guidelines in our <a href="https://github.com/Tencent/POINTS-GUI/tree/main/evaluation">GitHub Repository</a>.
44
-
45
- ## Introduction
46
-
47
- POINTS-GUI-G-8B is a specialized GUI Grounding model introduced in the paper [POINTS-GUI-G: GUI-Grounding Journey](https://huggingface.co/papers/2602.06391).
48
-
49
- 1. **State-of-the-Art Performance**: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision.
50
-
51
- 2. **Full-Stack Mastery**: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5. We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization.
52
-
53
- 3. **Refined Data Engineering**: We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases.
54
-
55
- ## Results
56
-
57
- We evaluate POINTS-GUI-G-8B on four widely used GUI grounding benchmarks: ScreenSpot-v2, ScreenSpot-Pro, OSWorld-G, and UI-Vision. The figure below summarizes our results compared with existing open-source and proprietary baselines.
58
-
59
- ![Example 1](images/results.png)
60
-
61
- ## Getting Started
62
-
63
- ### Run with Transformers
64
-
65
- Please first install [WePOINTS](https://github.com/WePOINTS/WePOINTS) using the following command:
66
-
67
- ```sh
68
- git clone https://github.com/WePOINTS/WePOINTS.git
69
- cd ./WePOINTS
70
- pip install -e .
71
- ```
72
-
73
- ```python
74
- from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
75
- import torch
76
-
77
- system_prompt_point = (
78
- 'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.
79
-
80
- '
81
- 'Requirements for the output:
82
- '
83
- '- Return only the point (x, y) representing the center of the target element
84
- '
85
- '- Coordinates must be normalized to the range [0, 1]
86
- '
87
- '- Round each coordinate to three decimal places
88
- '
89
- '- Format the output as strictly (x, y) without any additional text
90
- '
91
- )
92
- system_prompt_bbox = (
93
- 'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.
94
-
95
- '
96
- 'Requirements for the output:
97
- '
98
- '- Return only the bounding box coordinates (x0, y0, x1, y1)
99
- '
100
- '- Coordinates must be normalized to the range [0, 1]
101
- '
102
- '- Round each coordinate to three decimal places
103
- '
104
- '- Format the output as strictly (x0, y0, x1, y1) without any additional text.
105
- '
106
- )
107
- system_prompt = system_prompt_point # system_prompt_bbox
108
- user_prompt = "Click the 'Login' button" # replace with your instruction
109
- image_path = '/path/to/your/local/image'
110
- model_path = 'tencent/POINTS-GUI-G'
111
- model = AutoModelForCausalLM.from_pretrained(model_path,
112
- trust_remote_code=True,
113
- dtype=torch.bfloat16,
114
- device_map='cuda')
115
- tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
116
- image_processor = Qwen2VLImageProcessor.from_pretrained(model_path)
117
- content = [
118
- dict(type='image', image=image_path),
119
- dict(type='text', text=user_prompt)
120
- ]
121
- messages = [
122
- {
123
- 'role': 'system',
124
- 'content': [dict(type='text', text=system_prompt)]
125
- },
126
- {
127
- 'role': 'user',
128
- 'content': content
129
- }
130
- ]
131
- generation_config = {
132
- 'max_new_tokens': 2048,
133
- 'do_sample': False
134
- }
135
- response = model.chat(
136
- messages,
137
- tokenizer,
138
- image_processor,
139
- generation_config
140
- )
141
- print(response)
142
- ```
143
-
144
- ## Citation
145
-
146
- If you use this model in your work, please cite the following paper:
147
-
148
- ```
149
- @article{zhao2026pointsguigguigroundingjourney,
150
- title = {POINTS-GUI-G: GUI-Grounding Journey},
151
- author = {Zhao, Zhongyin and Liu, Yuan and Liu, Yikun and Wang, Haicheng and Tian, Le and Zhou, Xiao and You, Yangxiu and Yu, Zilin and Yu, Yang and Zhou, Jie},
152
- journal = {arXiv preprint arXiv:2602.06391},
153
- year = {2026}
154
- }
155
-
156
- @inproceedings{liu2025points,
157
- title={POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion},
158
- author={Liu, Yuan and Zhao, Zhongyin and Tian, Le and Wang, Haicheng and Ye, Xubing and You, Yangxiu and Yu, Zilin and Wu, Chuhan and Xiao, Zhou and Yu, Yang and others},
159
- booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
160
- pages={1576--1601},
161
- year={2025}
162
- }
163
- ```
 
18
  - multimodal
19
  ---
20