Instructions to use Fantominsight/YAPPERTAR-ai-VL-1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Fantominsight/YAPPERTAR-ai-VL-1.0 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Fantominsight/YAPPERTAR-ai-VL-1.0", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Fantominsight/YAPPERTAR-ai-VL-1.0", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Fantominsight/YAPPERTAR-ai-VL-1.0 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Fantominsight/YAPPERTAR-ai-VL-1.0"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Fantominsight/YAPPERTAR-ai-VL-1.0",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Fantominsight/YAPPERTAR-ai-VL-1.0

SGLang

How to use Fantominsight/YAPPERTAR-ai-VL-1.0 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Fantominsight/YAPPERTAR-ai-VL-1.0" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Fantominsight/YAPPERTAR-ai-VL-1.0",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Fantominsight/YAPPERTAR-ai-VL-1.0" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Fantominsight/YAPPERTAR-ai-VL-1.0",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Fantominsight/YAPPERTAR-ai-VL-1.0 with Docker Model Runner:
```
docker model run hf.co/Fantominsight/YAPPERTAR-ai-VL-1.0
```

yappertar4 commited on 12 days ago

Commit

7579fdd

verified ·

1 Parent(s): a02e1c8

Update README.md

Browse files

Files changed (1) hide show

README.md +0 -143

README.md CHANGED Viewed

@@ -18,146 +18,3 @@ tags:
 - multimodal
 ---
-<p align="center">
-    <img src="images/logo.png"/>
-<p>
-<p align="center">
-  <a href="https://huggingface.co/tencent/POINTS-GUI-G">
-    <img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace">
-  </a>
-  <a href="https://github.com/Tencent/POINTS-GUI">
-    <img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="GitHub Code">
-  </a>
-  <a href="https://huggingface.co/papers/2602.06391">
-    <img src="https://img.shields.io/badge/Paper-POINTS--GUI--G-d4333f?logo=arxiv&logoColor=white&colorA=cccccc&colorB=d4333f&style=flat" alt="Paper">
-  </a>
-  <a href="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view">
-    <img src="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view">
-  </a>
-</p>
-## News
-- 🔜 <b>Upcoming:</b> The <b>End-to-End GUI Agent Model</b> is currently under active development and will be released in a subsequent update. Stay tuned!
-- 🚀 2026.02.06: We are happy to present <b>POINTS-GUI-G</b>, our specialized GUI Grounding Model. To facilitate reproducible evaluation, we provide comprehensive scripts and guidelines in our <a href="https://github.com/Tencent/POINTS-GUI/tree/main/evaluation">GitHub Repository</a>.
-## Introduction
-POINTS-GUI-G-8B is a specialized GUI Grounding model introduced in the paper [POINTS-GUI-G: GUI-Grounding Journey](https://huggingface.co/papers/2602.06391).
-1. **State-of-the-Art Performance**: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision.
-2. **Full-Stack Mastery**: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5. We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization.
-3. **Refined Data Engineering**: We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases.
-## Results
-We evaluate POINTS-GUI-G-8B on four widely used GUI grounding benchmarks: ScreenSpot-v2, ScreenSpot-Pro, OSWorld-G, and UI-Vision. The figure below summarizes our results compared with existing open-source and proprietary baselines.
-![Example 1](images/results.png)
-## Getting Started
-### Run with Transformers
-Please first install [WePOINTS](https://github.com/WePOINTS/WePOINTS) using the following command:
-```sh
-git clone https://github.com/WePOINTS/WePOINTS.git
-cd ./WePOINTS
-pip install -e .
-```
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
-import torch
-system_prompt_point = (
-    'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.
-'
-    'Requirements for the output:
-'
-    '- Return only the point (x, y) representing the center of the target element
-'
-    '- Coordinates must be normalized to the range [0, 1]
-'
-    '- Round each coordinate to three decimal places
-'
-    '- Format the output as strictly (x, y) without any additional text
-'
-)
-system_prompt_bbox = (
-    'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.
-'
-    'Requirements for the output:
-'
-    '- Return only the bounding box coordinates (x0, y0, x1, y1)
-'
-    '- Coordinates must be normalized to the range [0, 1]
-'
-    '- Round each coordinate to three decimal places
-'
-    '- Format the output as strictly (x0, y0, x1, y1) without any additional text.
-'
-)
-system_prompt = system_prompt_point  # system_prompt_bbox
-user_prompt = "Click the 'Login' button"  # replace with your instruction
-image_path = '/path/to/your/local/image'
-model_path = 'tencent/POINTS-GUI-G'
-model = AutoModelForCausalLM.from_pretrained(model_path,
-                                             trust_remote_code=True,
-                                             dtype=torch.bfloat16,
-                                             device_map='cuda')
-tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
-image_processor = Qwen2VLImageProcessor.from_pretrained(model_path)
-content = [
-            dict(type='image', image=image_path),
-            dict(type='text', text=user_prompt)
-          ]
-messages = [
-        {
-            'role': 'system',
-            'content': [dict(type='text', text=system_prompt)]
-        },
-        {
-            'role': 'user',
-            'content': content
-        }
-    ]
-generation_config = {
-        'max_new_tokens': 2048,
-        'do_sample': False
-    }
-response = model.chat(
-    messages,
-    tokenizer,
-    image_processor,
-    generation_config
-)
-print(response)
-```
-## Citation
-If you use this model in your work, please cite the following paper:
-```
-@article{zhao2026pointsguigguigroundingjourney,
-  title   = {POINTS-GUI-G: GUI-Grounding Journey},
-  author  = {Zhao, Zhongyin and Liu, Yuan and Liu, Yikun and Wang, Haicheng and Tian, Le and Zhou, Xiao and You, Yangxiu and Yu, Zilin and Yu, Yang and Zhou, Jie},
-  journal = {arXiv preprint arXiv:2602.06391},
-  year    = {2026}
-}
-@inproceedings{liu2025points,
-  title={POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion},
-  author={Liu, Yuan and Zhao, Zhongyin and Tian, Le and Wang, Haicheng and Ye, Xubing and You, Yangxiu and Yu, Zilin and Wu, Chuhan and Xiao, Zhou and Yu, Yang and others},
-  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
-  pages={1576--1601},
-  year={2025}
-}
-```


18	- multimodal
19	---
20