Improve model card metadata and add paper/code links

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +25 -23
README.md CHANGED
@@ -1,7 +1,9 @@
1
  ---
2
- license: mit
3
  base_model:
4
  - microsoft/Phi-3.5-vision-instruct
 
 
 
5
  tags:
6
  - GUI
7
  - Agent
@@ -12,22 +14,23 @@ tags:
12
  # Microsoft Phi-Ground-Any-4B
13
 
14
  <p align="center">
15
- <a href="https://microsoft.github.io/Phi-Ground/" target="_blank">πŸ€– HomePage</a> | <a href="https://arxiv.org/abs/2605.12501" target="_blank">πŸ“„ Paper </a> | <a href="https://arxiv.org/abs/2605.12501" target="_blank">πŸ“„ Arxiv </a> | <a href="https://huggingface.co/microsoft/Phi-Ground-Any" target="_blank"> 😊 Model </a> | <a href="https://github.com/microsoft/Phi-Ground/tree/main/benchmark/" target="_blank"> 😊 Eval data </a>
16
  </p>
17
 
18
- ![overview](docs/images/intro.png)
19
 
20
- **Phi-Ground-Any-4B** is one of the Phi-Ground model family, finetuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with fixed input resolution 1680x1008.
21
 
22
  ### Main results
23
 
24
  ![overview](docs/images/r1.png)
25
 
26
  ### Usage
 
27
  The current `transformers` version can be verified with: `pip list | grep transformers`.
28
 
29
  Examples of required packages:
30
- ```
31
  flash_attn==2.5.8
32
  numpy==1.24.4
33
  Pillow==10.3.0
@@ -38,17 +41,15 @@ transformers==4.43.0
38
  accelerate==0.30.0
39
  ```
40
 
41
-
42
  ### Input Formats
43
 
44
- The model require strict input format including fixed image resolution, instruction-first order and system prompt.
45
 
46
- Input preprocessing
47
 
48
  ```python
49
  from PIL import Image
50
 
51
-
52
  def process_image(img):
53
  # Phi-Ground-Anything uses a larger 5x3-tile canvas (1680 x 1008).
54
  target_width, target_height = 336 * 5, 336 * 3
@@ -71,13 +72,9 @@ def process_image(img):
71
  return new_img, reshape_ratio
72
 
73
 
74
- # Phi-Ground-Anything takes the user instruction directly (no "describe the
75
- # element" wrapper) and is trained to emit the click point as
76
  # <x>VALUE</x><y>VALUE</y>
77
- # where VALUE is a relative coordinate in [0, 10000] over the padded canvas
78
- # (i.e., divide by 10000 and multiply by target_width / target_height to get
79
- # pixel coords in the padded image, then divide by reshape_ratio to recover
80
- # coords in the ORIGINAL image).
81
  instruction = "<your instruction>"
82
  prompt = """<|user|>
83
  {instruction}<|image_1|>
@@ -87,11 +84,11 @@ prompt = """<|user|>
87
  image_path = "<your image path>"
88
  original_image = Image.open(image_path).convert("RGB")
89
  image, reshape_ratio = process_image(original_image)
 
90
 
 
91
 
92
- # ---------------------------------------------------------------------------
93
- # Example: parse the model output and recover original-image coordinates.
94
- # ---------------------------------------------------------------------------
95
  import re
96
 
97
  target_width, target_height = 336 * 5, 336 * 3
@@ -100,24 +97,29 @@ SCALE = 10000.0
100
  x_pattern = re.compile(r"<x>\s*(-?\d+(?:\.\d+)?)\s*</x>")
101
  y_pattern = re.compile(r"<y>\s*(-?\d+(?:\.\d+)?)\s*</y>")
102
 
103
-
104
  def parse_xy(model_output: str):
105
  xs = [float(v) for v in x_pattern.findall(model_output)]
106
  ys = [float(v) for v in y_pattern.findall(model_output)]
107
  return list(zip(xs, ys))
108
 
109
-
110
  def to_original_pixel(rel_xy, reshape_ratio: float):
111
  x_rel, y_rel = rel_xy
112
  px = (x_rel / SCALE) * target_width / reshape_ratio
113
  py = (y_rel / SCALE) * target_height / reshape_ratio
114
  return px, py
115
 
116
-
117
  # model_output = "<x>4823</x><y>3120</y>"
118
  # point_orig = to_original_pixel(parse_xy(model_output)[0], reshape_ratio)
119
-
120
  ```
121
 
 
122
 
123
- Then you can use huggingface model or [vllm](https://github.com/vllm-project/vllm) to inference.
 
 
 
 
 
 
 
 
1
  ---
 
2
  base_model:
3
  - microsoft/Phi-3.5-vision-instruct
4
+ license: mit
5
+ pipeline_tag: image-text-to-text
6
+ library_name: transformers
7
  tags:
8
  - GUI
9
  - Agent
 
14
  # Microsoft Phi-Ground-Any-4B
15
 
16
  <p align="center">
17
+ <a href="https://microsoft.github.io/Phi-Ground/" target="_blank">πŸ€– HomePage</a> | <a href="https://arxiv.org/abs/2605.12501" target="_blank">πŸ“„ Paper </a> | <a href="https://github.com/microsoft/Phi-Ground" target="_blank"> πŸ’» Code </a> | <a href="https://huggingface.co/microsoft/Phi-Ground-Any" target="_blank"> 😊 Model </a> | <a href="https://github.com/microsoft/Phi-Ground/tree/main/benchmark/" target="_blank"> 😊 Eval data </a>
18
  </p>
19
 
20
+ **Phi-Ground-Any-4B** is a foundational grounding model for Computer Use Agents (CUAs), introduced in the paper "[Covering Human Action Space for Computer Use: Data Synthesis and Benchmark](https://arxiv.org/abs/2605.12501)". It is fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with a fixed input resolution of 1680x1008.
21
 
22
+ The model excels at complex interactions across five modalities: GUI, text, table, canvas, and natural image, supporting a variety of actions including clicking, dragging, and drawing.
23
 
24
  ### Main results
25
 
26
  ![overview](docs/images/r1.png)
27
 
28
  ### Usage
29
+
30
  The current `transformers` version can be verified with: `pip list | grep transformers`.
31
 
32
  Examples of required packages:
33
+ ```bash
34
  flash_attn==2.5.8
35
  numpy==1.24.4
36
  Pillow==10.3.0
 
41
  accelerate==0.30.0
42
  ```
43
 
 
44
  ### Input Formats
45
 
46
+ The model requires a strict input format including fixed image resolution, instruction-first order, and a specific system prompt.
47
 
48
+ #### Input Preprocessing
49
 
50
  ```python
51
  from PIL import Image
52
 
 
53
  def process_image(img):
54
  # Phi-Ground-Anything uses a larger 5x3-tile canvas (1680 x 1008).
55
  target_width, target_height = 336 * 5, 336 * 3
 
72
  return new_img, reshape_ratio
73
 
74
 
75
+ # Phi-Ground-Anything takes the user instruction directly and is trained to emit the click point as
 
76
  # <x>VALUE</x><y>VALUE</y>
77
+ # where VALUE is a relative coordinate in [0, 10000] over the padded canvas.
 
 
 
78
  instruction = "<your instruction>"
79
  prompt = """<|user|>
80
  {instruction}<|image_1|>
 
84
  image_path = "<your image path>"
85
  original_image = Image.open(image_path).convert("RGB")
86
  image, reshape_ratio = process_image(original_image)
87
+ ```
88
 
89
+ #### Output Parsing and Coordinate Recovery
90
 
91
+ ```python
 
 
92
  import re
93
 
94
  target_width, target_height = 336 * 5, 336 * 3
 
97
  x_pattern = re.compile(r"<x>\s*(-?\d+(?:\.\d+)?)\s*</x>")
98
  y_pattern = re.compile(r"<y>\s*(-?\d+(?:\.\d+)?)\s*</y>")
99
 
 
100
  def parse_xy(model_output: str):
101
  xs = [float(v) for v in x_pattern.findall(model_output)]
102
  ys = [float(v) for v in y_pattern.findall(model_output)]
103
  return list(zip(xs, ys))
104
 
 
105
  def to_original_pixel(rel_xy, reshape_ratio: float):
106
  x_rel, y_rel = rel_xy
107
  px = (x_rel / SCALE) * target_width / reshape_ratio
108
  py = (y_rel / SCALE) * target_height / reshape_ratio
109
  return px, py
110
 
111
+ # Example:
112
  # model_output = "<x>4823</x><y>3120</y>"
113
  # point_orig = to_original_pixel(parse_xy(model_output)[0], reshape_ratio)
 
114
  ```
115
 
116
+ ## Citation
117
 
118
+ ```bibtex
119
+ @article{zhang2025phi,
120
+ title={Covering Human Action Space for Computer Use: Data Synthesis and Benchmark},
121
+ author={Zhang, Miaosen and Zhao, Xiaohan and Tan, Zhihong and Huoshen, Zhou and Fan, Yijia and Yang, Yifan and Qiu, Kai and Liu, Bei and Wagle, Justin and Yin, Chenzhong and others},
122
+ journal={arXiv preprint arXiv:2605.12501},
123
+ year={2025}
124
+ }
125
+ ```