--- license: mit base_model: - microsoft/Phi-3.5-vision-instruct tags: - GUI - Agent - Grounding - CUA --- # Microsoft Phi-Ground-Any-4B

🤖 HomePage | 📄 Paper | 📄 Arxiv | 😊 Model | 😊 Eval data

![overview](docs/images/intro.png) **Phi-Ground-Any-4B** is one of the Phi-Ground model family, finetuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with fixed input resolution 1680x1008. ### Main results ![overview](docs/images/r1.png) ### Usage The current `transformers` version can be verified with: `pip list | grep transformers`. Examples of required packages: ``` flash_attn==2.5.8 numpy==1.24.4 Pillow==10.3.0 Requests==2.31.0 torch==2.3.0 torchvision==0.18.0 transformers==4.43.0 accelerate==0.30.0 ``` ### Input Formats The model require strict input format including fixed image resolution, instruction-first order and system prompt. Input preprocessing ```python from PIL import Image def process_image(img): # Phi-Ground-Anything uses a larger 5x3-tile canvas (1680 x 1008). target_width, target_height = 336 * 5, 336 * 3 img_ratio = img.width / img.height target_ratio = target_width / target_height if img_ratio > target_ratio: new_width = target_width new_height = int(new_width / img_ratio) else: new_height = target_height new_width = int(new_height * img_ratio) reshape_ratio = new_width / img.width img = img.resize((new_width, new_height), Image.LANCZOS) new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255)) paste_position = (0, 0) new_img.paste(img, paste_position) return new_img, reshape_ratio # Phi-Ground-Anything takes the user instruction directly (no "describe the # element" wrapper) and is trained to emit the click point as # VALUEVALUE # where VALUE is a relative coordinate in [0, 10000] over the padded canvas # (i.e., divide by 10000 and multiply by target_width / target_height to get # pixel coords in the padded image, then divide by reshape_ratio to recover # coords in the ORIGINAL image). instruction = "" prompt = """<|user|> {instruction}<|image_1|> <|end|> <|assistant|>""".format(instruction=instruction) image_path = "" original_image = Image.open(image_path).convert("RGB") image, reshape_ratio = process_image(original_image) # --------------------------------------------------------------------------- # Example: parse the model output and recover original-image coordinates. # --------------------------------------------------------------------------- import re target_width, target_height = 336 * 5, 336 * 3 SCALE = 10000.0 x_pattern = re.compile(r"\s*(-?\d+(?:\.\d+)?)\s*") y_pattern = re.compile(r"\s*(-?\d+(?:\.\d+)?)\s*") def parse_xy(model_output: str): xs = [float(v) for v in x_pattern.findall(model_output)] ys = [float(v) for v in y_pattern.findall(model_output)] return list(zip(xs, ys)) def to_original_pixel(rel_xy, reshape_ratio: float): x_rel, y_rel = rel_xy px = (x_rel / SCALE) * target_width / reshape_ratio py = (y_rel / SCALE) * target_height / reshape_ratio return px, py # model_output = "48233120" # point_orig = to_original_pixel(parse_xy(model_output)[0], reshape_ratio) ``` Then you can use huggingface model or [vllm](https://github.com/vllm-project/vllm) to inference.