File size: 3,863 Bytes
756f781 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | ---
license: mit
base_model:
- microsoft/Phi-3.5-vision-instruct
tags:
- GUI
- Agent
- Grounding
- CUA
---
# Microsoft Phi-Ground-Any-4B
<p align="center">
<a href="https://microsoft.github.io/Phi-Ground/" target="_blank">π€ HomePage</a> | <a href="https://huggingface.co/papers/2507.23779" target="_blank">π Paper </a> | <a href="https://arxiv.org/abs/2507.23779" target="_blank">π Arxiv </a> | <a href="https://huggingface.co/microsoft/Phi-Ground" target="_blank"> π Model </a> | <a href="https://github.com/microsoft/Phi-Ground/tree/main/benchmark/" target="_blank"> π Eval data </a>
</p>

**Phi-Ground-Any-4B** is one of the Phi-Ground model family, finetuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with fixed input resolution 1680x1008.
### Main results

### Usage
The current `transformers` version can be verified with: `pip list | grep transformers`.
Examples of required packages:
```
flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.43.0
accelerate==0.30.0
```
### Input Formats
The model require strict input format including fixed image resolution, instruction-first order and system prompt.
Input preprocessing
```python
from PIL import Image
def process_image(img):
# Phi-Ground-Anything uses a larger 5x3-tile canvas (1680 x 1008).
target_width, target_height = 336 * 5, 336 * 3
img_ratio = img.width / img.height
target_ratio = target_width / target_height
if img_ratio > target_ratio:
new_width = target_width
new_height = int(new_width / img_ratio)
else:
new_height = target_height
new_width = int(new_height * img_ratio)
reshape_ratio = new_width / img.width
img = img.resize((new_width, new_height), Image.LANCZOS)
new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255))
paste_position = (0, 0)
new_img.paste(img, paste_position)
return new_img, reshape_ratio
# Phi-Ground-Anything takes the user instruction directly (no "describe the
# element" wrapper) and is trained to emit the click point as
# <x>VALUE</x><y>VALUE</y>
# where VALUE is a relative coordinate in [0, 10000] over the padded canvas
# (i.e., divide by 10000 and multiply by target_width / target_height to get
# pixel coords in the padded image, then divide by reshape_ratio to recover
# coords in the ORIGINAL image).
instruction = "<your instruction>"
prompt = """<|user|>
{instruction}<|image_1|>
<|end|>
<|assistant|>""".format(instruction=instruction)
image_path = "<your image path>"
original_image = Image.open(image_path).convert("RGB")
image, reshape_ratio = process_image(original_image)
# ---------------------------------------------------------------------------
# Example: parse the model output and recover original-image coordinates.
# ---------------------------------------------------------------------------
import re
target_width, target_height = 336 * 5, 336 * 3
SCALE = 10000.0
x_pattern = re.compile(r"<x>\s*(-?\d+(?:\.\d+)?)\s*</x>")
y_pattern = re.compile(r"<y>\s*(-?\d+(?:\.\d+)?)\s*</y>")
def parse_xy(model_output: str):
xs = [float(v) for v in x_pattern.findall(model_output)]
ys = [float(v) for v in y_pattern.findall(model_output)]
return list(zip(xs, ys))
def to_original_pixel(rel_xy, reshape_ratio: float):
x_rel, y_rel = rel_xy
px = (x_rel / SCALE) * target_width / reshape_ratio
py = (y_rel / SCALE) * target_height / reshape_ratio
return px, py
# model_output = "<x>4823</x><y>3120</y>"
# point_orig = to_original_pixel(parse_xy(model_output)[0], reshape_ratio)
```
Then you can use huggingface model or [vllm](https://github.com/vllm-project/vllm) to inference. |