File size: 4,171 Bytes
756f781
 
 
bcd18f1
 
 
756f781
 
 
 
 
 
 
 
 
 
bcd18f1
756f781
 
bcd18f1
756f781
bcd18f1
756f781
 
 
 
 
 
bcd18f1
756f781
 
 
bcd18f1
756f781
 
 
 
 
 
 
 
 
 
 
 
bcd18f1
756f781
bcd18f1
756f781
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcd18f1
756f781
bcd18f1
756f781
 
 
 
 
 
 
 
 
bcd18f1
756f781
bcd18f1
756f781
bcd18f1
756f781
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcd18f1
756f781
 
 
 
bcd18f1
756f781
bcd18f1
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
base_model:
- microsoft/Phi-3.5-vision-instruct
license: mit
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- GUI
- Agent
- Grounding
- CUA
---

# Microsoft Phi-Ground-Any-4B

<p align="center">
   <a href="https://microsoft.github.io/Phi-Ground/" target="_blank">๐Ÿค– HomePage</a> | <a href="https://arxiv.org/abs/2605.12501" target="_blank">๐Ÿ“„ Paper </a> | <a href="https://github.com/microsoft/Phi-Ground" target="_blank"> ๐Ÿ’ป Code </a> | <a href="https://huggingface.co/microsoft/Phi-Ground-Any" target="_blank"> ๐Ÿ˜Š Model </a> | <a href="https://github.com/microsoft/Phi-Ground/tree/main/benchmark/" target="_blank"> ๐Ÿ˜Š Eval data </a> 
</p>

**Phi-Ground-Any-4B** is a foundational grounding model for Computer Use Agents (CUAs), introduced in the paper "[Covering Human Action Space for Computer Use: Data Synthesis and Benchmark](https://arxiv.org/abs/2605.12501)". It is fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with a fixed input resolution of 1680x1008. 

The model excels at complex interactions across five modalities: GUI, text, table, canvas, and natural image, supporting a variety of actions including clicking, dragging, and drawing.

### Main results

![overview](docs/images/r1.png)

### Usage

The current `transformers` version can be verified with: `pip list | grep transformers`.

Examples of required packages:
```bash
flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.43.0
accelerate==0.30.0
```

### Input Formats

The model requires a strict input format including fixed image resolution, instruction-first order, and a specific system prompt.

#### Input Preprocessing

```python
from PIL import Image

def process_image(img):
    # Phi-Ground-Anything uses a larger 5x3-tile canvas (1680 x 1008).
    target_width, target_height = 336 * 5, 336 * 3

    img_ratio = img.width / img.height
    target_ratio = target_width / target_height

    if img_ratio > target_ratio:
        new_width = target_width
        new_height = int(new_width / img_ratio)
    else:
        new_height = target_height
        new_width = int(new_height * img_ratio)
    reshape_ratio = new_width / img.width

    img = img.resize((new_width, new_height), Image.LANCZOS)
    new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255))
    paste_position = (0, 0)
    new_img.paste(img, paste_position)
    return new_img, reshape_ratio


# Phi-Ground-Anything takes the user instruction directly and is trained to emit the click point as
#   <x>VALUE</x><y>VALUE</y>
# where VALUE is a relative coordinate in [0, 10000] over the padded canvas.
instruction = "<your instruction>"
prompt = """<|user|> 
{instruction}<|image_1|> 
<|end|> 
<|assistant|>""".format(instruction=instruction)

image_path = "<your image path>"
original_image = Image.open(image_path).convert("RGB")
image, reshape_ratio = process_image(original_image)
```

#### Output Parsing and Coordinate Recovery

```python
import re

target_width, target_height = 336 * 5, 336 * 3
SCALE = 10000.0

x_pattern = re.compile(r"<x>\s*(-?\d+(?:\.\d+)?)\s*</x>")
y_pattern = re.compile(r"<y>\s*(-?\d+(?:\.\d+)?)\s*</y>")

def parse_xy(model_output: str):
    xs = [float(v) for v in x_pattern.findall(model_output)]
    ys = [float(v) for v in y_pattern.findall(model_output)]
    return list(zip(xs, ys))

def to_original_pixel(rel_xy, reshape_ratio: float):
    x_rel, y_rel = rel_xy
    px = (x_rel / SCALE) * target_width / reshape_ratio
    py = (y_rel / SCALE) * target_height / reshape_ratio
    return px, py

# Example:
# model_output = "<x>4823</x><y>3120</y>"
# point_orig = to_original_pixel(parse_xy(model_output)[0], reshape_ratio)
```

## Citation

```bibtex
@article{zhang2025phi,
  title={Covering Human Action Space for Computer Use: Data Synthesis and Benchmark},
  author={Zhang, Miaosen and Zhao, Xiaohan and Tan, Zhihong and Huoshen, Zhou and Fan, Yijia and Yang, Yifan and Qiu, Kai and Liu, Bei and Wagle, Justin and Yin, Chenzhong and others},
  journal={arXiv preprint arXiv:2605.12501},
  year={2025}
}
```