--- base_model: - microsoft/Phi-3.5-vision-instruct license: mit pipeline_tag: image-text-to-text library_name: transformers tags: - GUI - Agent - Grounding - CUA --- # Microsoft Phi-Ground-Any-4B
🤖 HomePage | 📄 Paper | 💻 Code | 😊 Model | 😊 Eval data
**Phi-Ground-Any-4B** is a foundational grounding model for Computer Use Agents (CUAs), introduced in the paper "[Covering Human Action Space for Computer Use: Data Synthesis and Benchmark](https://arxiv.org/abs/2605.12501)". It is fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with a fixed input resolution of 1680x1008. The model excels at complex interactions across five modalities: GUI, text, table, canvas, and natural image, supporting a variety of actions including clicking, dragging, and drawing. ### Main results  ### Usage The current `transformers` version can be verified with: `pip list | grep transformers`. Examples of required packages: ```bash flash_attn==2.5.8 numpy==1.24.4 Pillow==10.3.0 Requests==2.31.0 torch==2.3.0 torchvision==0.18.0 transformers==4.43.0 accelerate==0.30.0 ``` ### Input Formats The model requires a strict input format including fixed image resolution, instruction-first order, and a specific system prompt. #### Input Preprocessing ```python from PIL import Image def process_image(img): # Phi-Ground-Anything uses a larger 5x3-tile canvas (1680 x 1008). target_width, target_height = 336 * 5, 336 * 3 img_ratio = img.width / img.height target_ratio = target_width / target_height if img_ratio > target_ratio: new_width = target_width new_height = int(new_width / img_ratio) else: new_height = target_height new_width = int(new_height * img_ratio) reshape_ratio = new_width / img.width img = img.resize((new_width, new_height), Image.LANCZOS) new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255)) paste_position = (0, 0) new_img.paste(img, paste_position) return new_img, reshape_ratio # Phi-Ground-Anything takes the user instruction directly and is trained to emit the click point as #