Phi-Ground-Any / README.md

nielsr HF Staff

Improve model card metadata and add paper/code links

bcd18f1 verified about 12 hours ago

4.17 kB

base_model:
  - microsoft/Phi-3.5-vision-instruct
license: mit
pipeline_tag: image-text-to-text
library_name: transformers
tags:
  - GUI
  - Agent
  - Grounding
  - CUA

Microsoft Phi-Ground-Any-4B

🤖 HomePage | 📄 Paper | 💻 Code | 😊 Model | 😊 Eval data

Phi-Ground-Any-4B is a foundational grounding model for Computer Use Agents (CUAs), introduced in the paper "Covering Human Action Space for Computer Use: Data Synthesis and Benchmark". It is fine-tuned from microsoft/Phi-3.5-vision-instruct with a fixed input resolution of 1680x1008.

The model excels at complex interactions across five modalities: GUI, text, table, canvas, and natural image, supporting a variety of actions including clicking, dragging, and drawing.

Main results

Usage

The current transformers version can be verified with: pip list | grep transformers.

Examples of required packages:

flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.43.0
accelerate==0.30.0

Input Formats

The model requires a strict input format including fixed image resolution, instruction-first order, and a specific system prompt.

Input Preprocessing

from PIL import Image

def process_image(img):
    # Phi-Ground-Anything uses a larger 5x3-tile canvas (1680 x 1008).
    target_width, target_height = 336 * 5, 336 * 3

    img_ratio = img.width / img.height
    target_ratio = target_width / target_height

    if img_ratio > target_ratio:
        new_width = target_width
        new_height = int(new_width / img_ratio)
    else:
        new_height = target_height
        new_width = int(new_height * img_ratio)
    reshape_ratio = new_width / img.width

    img = img.resize((new_width, new_height), Image.LANCZOS)
    new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255))
    paste_position = (0, 0)
    new_img.paste(img, paste_position)
    return new_img, reshape_ratio


# Phi-Ground-Anything takes the user instruction directly and is trained to emit the click point as
#   <x>VALUE</x><y>VALUE</y>
# where VALUE is a relative coordinate in [0, 10000] over the padded canvas.
instruction = "<your instruction>"
prompt = """<|user|> 
{instruction}<|image_1|> 
<|end|> 
<|assistant|>""".format(instruction=instruction)

image_path = "<your image path>"
original_image = Image.open(image_path).convert("RGB")
image, reshape_ratio = process_image(original_image)

Output Parsing and Coordinate Recovery

import re

target_width, target_height = 336 * 5, 336 * 3
SCALE = 10000.0

x_pattern = re.compile(r"<x>\s*(-?\d+(?:\.\d+)?)\s*</x>")
y_pattern = re.compile(r"<y>\s*(-?\d+(?:\.\d+)?)\s*</y>")

def parse_xy(model_output: str):
    xs = [float(v) for v in x_pattern.findall(model_output)]
    ys = [float(v) for v in y_pattern.findall(model_output)]
    return list(zip(xs, ys))

def to_original_pixel(rel_xy, reshape_ratio: float):
    x_rel, y_rel = rel_xy
    px = (x_rel / SCALE) * target_width / reshape_ratio
    py = (y_rel / SCALE) * target_height / reshape_ratio
    return px, py

# Example:
# model_output = "<x>4823</x><y>3120</y>"
# point_orig = to_original_pixel(parse_xy(model_output)[0], reshape_ratio)

Citation

@article{zhang2025phi,
  title={Covering Human Action Space for Computer Use: Data Synthesis and Benchmark},
  author={Zhang, Miaosen and Zhao, Xiaohan and Tan, Zhihong and Huoshen, Zhou and Fan, Yijia and Yang, Yifan and Qiu, Kai and Liu, Bei and Wagle, Justin and Yin, Chenzhong and others},
  journal={arXiv preprint arXiv:2605.12501},
  year={2025}
}