microsoft
/

Phi-Ground-Any

@@ -1,7 +1,9 @@
 ---
-license: mit
 base_model:
 - microsoft/Phi-3.5-vision-instruct
 tags:
 - GUI
 - Agent
@@ -12,22 +14,23 @@ tags:
 # Microsoft Phi-Ground-Any-4B
 <p align="center">
-   <a href="https://microsoft.github.io/Phi-Ground/" target="_blank">🤖 HomePage</a> | <a href="https://arxiv.org/abs/2605.12501" target="_blank">📄 Paper </a> | <a href="https://arxiv.org/abs/2605.12501" target="_blank">📄 Arxiv </a> | <a href="https://huggingface.co/microsoft/Phi-Ground-Any" target="_blank"> 😊 Model </a> | <a href="https://github.com/microsoft/Phi-Ground/tree/main/benchmark/" target="_blank"> 😊 Eval data </a>
 </p>
-![overview](docs/images/intro.png)
-**Phi-Ground-Any-4B** is one of the Phi-Ground model family, finetuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with fixed input resolution 1680x1008.
 ### Main results
 ![overview](docs/images/r1.png)
 ### Usage
 The current `transformers` version can be verified with: `pip list | grep transformers`.
 Examples of required packages:
-```
 flash_attn==2.5.8
 numpy==1.24.4
 Pillow==10.3.0
@@ -38,17 +41,15 @@ transformers==4.43.0
 accelerate==0.30.0
 ```
 ### Input Formats
-The model require strict input format including fixed image resolution, instruction-first order and system prompt.
-Input preprocessing
 ```python
 from PIL import Image
 def process_image(img):
     # Phi-Ground-Anything uses a larger 5x3-tile canvas (1680 x 1008).
     target_width, target_height = 336 * 5, 336 * 3
@@ -71,13 +72,9 @@ def process_image(img):
     return new_img, reshape_ratio
-# Phi-Ground-Anything takes the user instruction directly (no "describe the
-# element" wrapper) and is trained to emit the click point as
 #   <x>VALUE</x><y>VALUE</y>
-# where VALUE is a relative coordinate in [0, 10000] over the padded canvas
-# (i.e., divide by 10000 and multiply by target_width / target_height to get
-# pixel coords in the padded image, then divide by reshape_ratio to recover
-# coords in the ORIGINAL image).
 instruction = "<your instruction>"
 prompt = """<|user|>
 {instruction}<|image_1|>
@@ -87,11 +84,11 @@ prompt = """<|user|>
 image_path = "<your image path>"
 original_image = Image.open(image_path).convert("RGB")
 image, reshape_ratio = process_image(original_image)
-# ---------------------------------------------------------------------------
-# Example: parse the model output and recover original-image coordinates.
-# ---------------------------------------------------------------------------
 import re
 target_width, target_height = 336 * 5, 336 * 3
@@ -100,24 +97,29 @@ SCALE = 10000.0
 x_pattern = re.compile(r"<x>\s*(-?\d+(?:\.\d+)?)\s*</x>")
 y_pattern = re.compile(r"<y>\s*(-?\d+(?:\.\d+)?)\s*</y>")
 def parse_xy(model_output: str):
     xs = [float(v) for v in x_pattern.findall(model_output)]
     ys = [float(v) for v in y_pattern.findall(model_output)]
     return list(zip(xs, ys))
 def to_original_pixel(rel_xy, reshape_ratio: float):
     x_rel, y_rel = rel_xy
     px = (x_rel / SCALE) * target_width / reshape_ratio
     py = (y_rel / SCALE) * target_height / reshape_ratio
     return px, py
 # model_output = "<x>4823</x><y>3120</y>"
 # point_orig = to_original_pixel(parse_xy(model_output)[0], reshape_ratio)
 ```
-Then you can use huggingface model or [vllm](https://github.com/vllm-project/vllm) to inference.

 ---
 base_model:
 - microsoft/Phi-3.5-vision-instruct
+license: mit
+pipeline_tag: image-text-to-text
+library_name: transformers
 tags:
 - GUI
 - Agent
 # Microsoft Phi-Ground-Any-4B
 <p align="center">
+   <a href="https://microsoft.github.io/Phi-Ground/" target="_blank">🤖 HomePage</a> | <a href="https://arxiv.org/abs/2605.12501" target="_blank">📄 Paper </a> | <a href="https://github.com/microsoft/Phi-Ground" target="_blank"> 💻 Code </a> | <a href="https://huggingface.co/microsoft/Phi-Ground-Any" target="_blank"> 😊 Model </a> | <a href="https://github.com/microsoft/Phi-Ground/tree/main/benchmark/" target="_blank"> 😊 Eval data </a>
 </p>
+**Phi-Ground-Any-4B** is a foundational grounding model for Computer Use Agents (CUAs), introduced in the paper "[Covering Human Action Space for Computer Use: Data Synthesis and Benchmark](https://arxiv.org/abs/2605.12501)". It is fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with a fixed input resolution of 1680x1008.
+The model excels at complex interactions across five modalities: GUI, text, table, canvas, and natural image, supporting a variety of actions including clicking, dragging, and drawing.
 ### Main results
 ![overview](docs/images/r1.png)
 ### Usage
 The current `transformers` version can be verified with: `pip list | grep transformers`.
 Examples of required packages:
+```bash
 flash_attn==2.5.8
 numpy==1.24.4
 Pillow==10.3.0
 accelerate==0.30.0
 ```
 ### Input Formats
+The model requires a strict input format including fixed image resolution, instruction-first order, and a specific system prompt.
+#### Input Preprocessing
 ```python
 from PIL import Image
 def process_image(img):
     # Phi-Ground-Anything uses a larger 5x3-tile canvas (1680 x 1008).
     target_width, target_height = 336 * 5, 336 * 3
     return new_img, reshape_ratio
+# Phi-Ground-Anything takes the user instruction directly and is trained to emit the click point as
 #   <x>VALUE</x><y>VALUE</y>
+# where VALUE is a relative coordinate in [0, 10000] over the padded canvas.
 instruction = "<your instruction>"
 prompt = """<|user|>
 {instruction}<|image_1|>
 image_path = "<your image path>"
 original_image = Image.open(image_path).convert("RGB")
 image, reshape_ratio = process_image(original_image)
+```
+#### Output Parsing and Coordinate Recovery
+```python
 import re
 target_width, target_height = 336 * 5, 336 * 3
 x_pattern = re.compile(r"<x>\s*(-?\d+(?:\.\d+)?)\s*</x>")
 y_pattern = re.compile(r"<y>\s*(-?\d+(?:\.\d+)?)\s*</y>")
 def parse_xy(model_output: str):
     xs = [float(v) for v in x_pattern.findall(model_output)]
     ys = [float(v) for v in y_pattern.findall(model_output)]
     return list(zip(xs, ys))
 def to_original_pixel(rel_xy, reshape_ratio: float):
     x_rel, y_rel = rel_xy
     px = (x_rel / SCALE) * target_width / reshape_ratio
     py = (y_rel / SCALE) * target_height / reshape_ratio
     return px, py
+# Example:
 # model_output = "<x>4823</x><y>3120</y>"
 # point_orig = to_original_pixel(parse_xy(model_output)[0], reshape_ratio)
 ```
+## Citation
+```bibtex
+@article{zhang2025phi,
+  title={Covering Human Action Space for Computer Use: Data Synthesis and Benchmark},
+  author={Zhang, Miaosen and Zhao, Xiaohan and Tan, Zhihong and Huoshen, Zhou and Fan, Yijia and Yang, Yifan and Qiu, Kai and Liu, Bei and Wagle, Justin and Yin, Chenzhong and others},
+  journal={arXiv preprint arXiv:2605.12501},
+  year={2025}
+}
+```