nielsr HF Staff

Improve model card metadata and add paper/code links

bcd18f1 verified about 13 hours ago

4.17 kB

	---
	base_model:
	- microsoft/Phi-3.5-vision-instruct
	license: mit
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- GUI
	- Agent
	- Grounding
	- CUA
	---

	# Microsoft Phi-Ground-Any-4B

	<p align="center">
	<a href="https://microsoft.github.io/Phi-Ground/" target="_blank">🤖 HomePage</a> \| <a href="https://arxiv.org/abs/2605.12501" target="_blank">📄 Paper </a> \| <a href="https://github.com/microsoft/Phi-Ground" target="_blank"> 💻 Code </a> \| <a href="https://huggingface.co/microsoft/Phi-Ground-Any" target="_blank"> 😊 Model </a> \| <a href="https://github.com/microsoft/Phi-Ground/tree/main/benchmark/" target="_blank"> 😊 Eval data </a>
	</p>

	Phi-Ground-Any-4B is a foundational grounding model for Computer Use Agents (CUAs), introduced in the paper "[Covering Human Action Space for Computer Use: Data Synthesis and Benchmark](https://arxiv.org/abs/2605.12501)". It is fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with a fixed input resolution of 1680x1008.

	The model excels at complex interactions across five modalities: GUI, text, table, canvas, and natural image, supporting a variety of actions including clicking, dragging, and drawing.

	### Main results

	![overview](docs/images/r1.png)

	### Usage

	The current `transformers` version can be verified with: `pip list \| grep transformers`.

	Examples of required packages:
	```bash
	flash_attn==2.5.8
	numpy==1.24.4
	Pillow==10.3.0
	Requests==2.31.0
	torch==2.3.0
	torchvision==0.18.0
	transformers==4.43.0
	accelerate==0.30.0
	```

	### Input Formats

	The model requires a strict input format including fixed image resolution, instruction-first order, and a specific system prompt.

	#### Input Preprocessing

	```python
	from PIL import Image

	def process_image(img):
	# Phi-Ground-Anything uses a larger 5x3-tile canvas (1680 x 1008).
	target_width, target_height = 336 * 5, 336 * 3

	img_ratio = img.width / img.height
	target_ratio = target_width / target_height

	if img_ratio > target_ratio:
	new_width = target_width
	new_height = int(new_width / img_ratio)
	else:
	new_height = target_height
	new_width = int(new_height * img_ratio)
	reshape_ratio = new_width / img.width

	img = img.resize((new_width, new_height), Image.LANCZOS)
	new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255))
	paste_position = (0, 0)
	new_img.paste(img, paste_position)
	return new_img, reshape_ratio


	# Phi-Ground-Anything takes the user instruction directly and is trained to emit the click point as
	# <x>VALUE</x><y>VALUE</y>
	# where VALUE is a relative coordinate in [0, 10000] over the padded canvas.
	instruction = "<your instruction>"
	prompt = """<\|user\|>
	{instruction}<\|image_1\|>
	<\|end\|>
	<\|assistant\|>""".format(instruction=instruction)

	image_path = "<your image path>"
	original_image = Image.open(image_path).convert("RGB")
	image, reshape_ratio = process_image(original_image)
	```

	#### Output Parsing and Coordinate Recovery

	```python
	import re

	target_width, target_height = 336 * 5, 336 * 3
	SCALE = 10000.0

	x_pattern = re.compile(r"<x>\s(-?\d+(?:\.\d+)?)\s</x>")
	y_pattern = re.compile(r"<y>\s(-?\d+(?:\.\d+)?)\s</y>")

	def parse_xy(model_output: str):
	xs = [float(v) for v in x_pattern.findall(model_output)]
	ys = [float(v) for v in y_pattern.findall(model_output)]
	return list(zip(xs, ys))

	def to_original_pixel(rel_xy, reshape_ratio: float):
	x_rel, y_rel = rel_xy
	px = (x_rel / SCALE) * target_width / reshape_ratio
	py = (y_rel / SCALE) * target_height / reshape_ratio
	return px, py

	# Example:
	# model_output = "<x>4823</x><y>3120</y>"
	# point_orig = to_original_pixel(parse_xy(model_output)[0], reshape_ratio)
	```

	## Citation

	```bibtex
	@article{zhang2025phi,
	title={Covering Human Action Space for Computer Use: Data Synthesis and Benchmark},
	author={Zhang, Miaosen and Zhao, Xiaohan and Tan, Zhihong and Huoshen, Zhou and Fan, Yijia and Yang, Yifan and Qiu, Kai and Liu, Bei and Wagle, Justin and Yin, Chenzhong and others},
	journal={arXiv preprint arXiv:2605.12501},
	year={2025}
	}
	```