File size: 4,171 Bytes
756f781 bcd18f1 756f781 bcd18f1 756f781 bcd18f1 756f781 bcd18f1 756f781 bcd18f1 756f781 bcd18f1 756f781 bcd18f1 756f781 bcd18f1 756f781 bcd18f1 756f781 bcd18f1 756f781 bcd18f1 756f781 bcd18f1 756f781 bcd18f1 756f781 bcd18f1 756f781 bcd18f1 756f781 bcd18f1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | ---
base_model:
- microsoft/Phi-3.5-vision-instruct
license: mit
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- GUI
- Agent
- Grounding
- CUA
---
# Microsoft Phi-Ground-Any-4B
<p align="center">
<a href="https://microsoft.github.io/Phi-Ground/" target="_blank">๐ค HomePage</a> | <a href="https://arxiv.org/abs/2605.12501" target="_blank">๐ Paper </a> | <a href="https://github.com/microsoft/Phi-Ground" target="_blank"> ๐ป Code </a> | <a href="https://huggingface.co/microsoft/Phi-Ground-Any" target="_blank"> ๐ Model </a> | <a href="https://github.com/microsoft/Phi-Ground/tree/main/benchmark/" target="_blank"> ๐ Eval data </a>
</p>
**Phi-Ground-Any-4B** is a foundational grounding model for Computer Use Agents (CUAs), introduced in the paper "[Covering Human Action Space for Computer Use: Data Synthesis and Benchmark](https://arxiv.org/abs/2605.12501)". It is fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with a fixed input resolution of 1680x1008.
The model excels at complex interactions across five modalities: GUI, text, table, canvas, and natural image, supporting a variety of actions including clicking, dragging, and drawing.
### Main results

### Usage
The current `transformers` version can be verified with: `pip list | grep transformers`.
Examples of required packages:
```bash
flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.43.0
accelerate==0.30.0
```
### Input Formats
The model requires a strict input format including fixed image resolution, instruction-first order, and a specific system prompt.
#### Input Preprocessing
```python
from PIL import Image
def process_image(img):
# Phi-Ground-Anything uses a larger 5x3-tile canvas (1680 x 1008).
target_width, target_height = 336 * 5, 336 * 3
img_ratio = img.width / img.height
target_ratio = target_width / target_height
if img_ratio > target_ratio:
new_width = target_width
new_height = int(new_width / img_ratio)
else:
new_height = target_height
new_width = int(new_height * img_ratio)
reshape_ratio = new_width / img.width
img = img.resize((new_width, new_height), Image.LANCZOS)
new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255))
paste_position = (0, 0)
new_img.paste(img, paste_position)
return new_img, reshape_ratio
# Phi-Ground-Anything takes the user instruction directly and is trained to emit the click point as
# <x>VALUE</x><y>VALUE</y>
# where VALUE is a relative coordinate in [0, 10000] over the padded canvas.
instruction = "<your instruction>"
prompt = """<|user|>
{instruction}<|image_1|>
<|end|>
<|assistant|>""".format(instruction=instruction)
image_path = "<your image path>"
original_image = Image.open(image_path).convert("RGB")
image, reshape_ratio = process_image(original_image)
```
#### Output Parsing and Coordinate Recovery
```python
import re
target_width, target_height = 336 * 5, 336 * 3
SCALE = 10000.0
x_pattern = re.compile(r"<x>\s*(-?\d+(?:\.\d+)?)\s*</x>")
y_pattern = re.compile(r"<y>\s*(-?\d+(?:\.\d+)?)\s*</y>")
def parse_xy(model_output: str):
xs = [float(v) for v in x_pattern.findall(model_output)]
ys = [float(v) for v in y_pattern.findall(model_output)]
return list(zip(xs, ys))
def to_original_pixel(rel_xy, reshape_ratio: float):
x_rel, y_rel = rel_xy
px = (x_rel / SCALE) * target_width / reshape_ratio
py = (y_rel / SCALE) * target_height / reshape_ratio
return px, py
# Example:
# model_output = "<x>4823</x><y>3120</y>"
# point_orig = to_original_pixel(parse_xy(model_output)[0], reshape_ratio)
```
## Citation
```bibtex
@article{zhang2025phi,
title={Covering Human Action Space for Computer Use: Data Synthesis and Benchmark},
author={Zhang, Miaosen and Zhao, Xiaohan and Tan, Zhihong and Huoshen, Zhou and Fan, Yijia and Yang, Yifan and Qiu, Kai and Liu, Bei and Wagle, Justin and Yin, Chenzhong and others},
journal={arXiv preprint arXiv:2605.12501},
year={2025}
}
``` |