| --- |
| base_model: |
| - microsoft/Phi-3.5-vision-instruct |
| license: mit |
| pipeline_tag: image-text-to-text |
| library_name: transformers |
| tags: |
| - GUI |
| - Agent |
| - Grounding |
| - CUA |
| --- |
| |
| # Microsoft Phi-Ground-Any-4B |
|
|
| <p align="center"> |
| <a href="https://microsoft.github.io/Phi-Ground/" target="_blank">π€ HomePage</a> | <a href="https://arxiv.org/abs/2605.12501" target="_blank">π Paper </a> | <a href="https://github.com/microsoft/Phi-Ground" target="_blank"> π» Code </a> | <a href="https://huggingface.co/microsoft/Phi-Ground-Any" target="_blank"> π Model </a> | <a href="https://github.com/microsoft/Phi-Ground/tree/main/benchmark/" target="_blank"> π Eval data </a> |
| </p> |
|
|
| **Phi-Ground-Any-4B** is a foundational grounding model for Computer Use Agents (CUAs), introduced in the paper "[Covering Human Action Space for Computer Use: Data Synthesis and Benchmark](https://arxiv.org/abs/2605.12501)". It is fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with a fixed input resolution of 1680x1008. |
|
|
| The model excels at complex interactions across five modalities: GUI, text, table, canvas, and natural image, supporting a variety of actions including clicking, dragging, and drawing. |
|
|
| ### Main results |
|
|
|  |
|
|
| ### Usage |
|
|
| The current `transformers` version can be verified with: `pip list | grep transformers`. |
|
|
| Examples of required packages: |
| ```bash |
| flash_attn==2.5.8 |
| numpy==1.24.4 |
| Pillow==10.3.0 |
| Requests==2.31.0 |
| torch==2.3.0 |
| torchvision==0.18.0 |
| transformers==4.43.0 |
| accelerate==0.30.0 |
| ``` |
|
|
| ### Input Formats |
|
|
| The model requires a strict input format including fixed image resolution, instruction-first order, and a specific system prompt. |
|
|
| #### Input Preprocessing |
|
|
| ```python |
| from PIL import Image |
| |
| def process_image(img): |
| # Phi-Ground-Anything uses a larger 5x3-tile canvas (1680 x 1008). |
| target_width, target_height = 336 * 5, 336 * 3 |
| |
| img_ratio = img.width / img.height |
| target_ratio = target_width / target_height |
| |
| if img_ratio > target_ratio: |
| new_width = target_width |
| new_height = int(new_width / img_ratio) |
| else: |
| new_height = target_height |
| new_width = int(new_height * img_ratio) |
| reshape_ratio = new_width / img.width |
| |
| img = img.resize((new_width, new_height), Image.LANCZOS) |
| new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255)) |
| paste_position = (0, 0) |
| new_img.paste(img, paste_position) |
| return new_img, reshape_ratio |
| |
| |
| # Phi-Ground-Anything takes the user instruction directly and is trained to emit the click point as |
| # <x>VALUE</x><y>VALUE</y> |
| # where VALUE is a relative coordinate in [0, 10000] over the padded canvas. |
| instruction = "<your instruction>" |
| prompt = """<|user|> |
| {instruction}<|image_1|> |
| <|end|> |
| <|assistant|>""".format(instruction=instruction) |
| |
| image_path = "<your image path>" |
| original_image = Image.open(image_path).convert("RGB") |
| image, reshape_ratio = process_image(original_image) |
| ``` |
|
|
| #### Output Parsing and Coordinate Recovery |
|
|
| ```python |
| import re |
| |
| target_width, target_height = 336 * 5, 336 * 3 |
| SCALE = 10000.0 |
| |
| x_pattern = re.compile(r"<x>\s*(-?\d+(?:\.\d+)?)\s*</x>") |
| y_pattern = re.compile(r"<y>\s*(-?\d+(?:\.\d+)?)\s*</y>") |
| |
| def parse_xy(model_output: str): |
| xs = [float(v) for v in x_pattern.findall(model_output)] |
| ys = [float(v) for v in y_pattern.findall(model_output)] |
| return list(zip(xs, ys)) |
| |
| def to_original_pixel(rel_xy, reshape_ratio: float): |
| x_rel, y_rel = rel_xy |
| px = (x_rel / SCALE) * target_width / reshape_ratio |
| py = (y_rel / SCALE) * target_height / reshape_ratio |
| return px, py |
| |
| # Example: |
| # model_output = "<x>4823</x><y>3120</y>" |
| # point_orig = to_original_pixel(parse_xy(model_output)[0], reshape_ratio) |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{zhang2025phi, |
| title={Covering Human Action Space for Computer Use: Data Synthesis and Benchmark}, |
| author={Zhang, Miaosen and Zhao, Xiaohan and Tan, Zhihong and Huoshen, Zhou and Fan, Yijia and Yang, Yifan and Qiu, Kai and Liu, Bei and Wagle, Justin and Yin, Chenzhong and others}, |
| journal={arXiv preprint arXiv:2605.12501}, |
| year={2025} |
| } |
| ``` |