KV-Ground-4B-BaseQw3vl
A small GUI grounding model optimized for high-resolution images.
- Type: Vision-Language Model (VLM) for GUI grounding
- Size: 4B parameters
- Input: Image + natural language instruction
- Output: Text
- Fine-tuned from: Qwen3-VL-4B-Instruct
- License: CC BY-NC-SA 4.0
- Developed by: Kingsware & Vocaela AI
Model Description
This model is developed to optimize small VLM models for high-resolution GUI grounding. We synthesize high-quality high-resolution GUI grounding data, and continue post-training Qwen3-VL-4B-Instruct with SFT followed by RFT (GRPO). Without reasoning CoT, it achieves 63.2 on ScreenSpot-Pro, becoming one of the best model at 4B range. Meanwhile, it maintains excellent performance on regular-resolution tasks with 94.6 on ScreenSpot-V2.
Key receipe:
- Data clean by MLLM as judge: various public GUI grounding datasets are of ~30% errors which highly degenerate model performance on high-resolution images. And hence we carefuly do multiple-rounds of data cleaning by MLLM-as-judge.
- Synthesize high-resolution GUI grounding data
- Continue post-training model through SFT and GRPO
Benchmark Results
Impact of continue post-training on base models
For the purpose of controlled comparison, all these numbers are re-/produced by us, using the same evaluation code in the repo kv-ground. The baseline numbers may be different from the sources. Please see below
notesection for the controlled setup.Models ScreenSpot-Pro ScreenSpot-V2 OSWorld-G OSWorld-G-refined UI-Vision Base: Qwen3-VL-4B-Instruct* 59.5 93.1 63.3 71.1 30.4 KV-Ground-4B-BaseQw3vl* 63.2 (+3.7) 94.6 (+1.5) 64.0 (0.7) 71.2 (+0.1) 32.6 (+2.2) The results tell us:
- Our continuous post-training method brings in consistent improvements
- The high-resolution optimized training doesn't harm regular resolution tasks
Comparision with top models under 8B (ranked by ScreenSpot-Pro)
We consider the top models under 8B from ScreenSpot-Pro leaderboard and related most recent technical reports.
Models ScreenSpot-Pro ScreenSpot-V2 OSWorld-G OSWorld-G-refined UI-Vision Specialized GUI Models UI-Venus-1.5-8B 68.4 93.2 69.4 - - KV-Ground-4B-BaseGuiOwl1.5* 66.5 94.3 62.8 69.1 32.2 MAI-UI-8B 65.8 95.2 60.1 68.6 40.7 GUI-Owl-1.5-4B-Instruct* 65.3 92.8 61.7 66.8 30.4 KV-Ground-4B-BaseQw3vl* 63.2 94.6 64.0 71.2 32.6 Step-GUI-8B 62.6 95.1 70.0 - - Step-GUI-4B 60.0 93.6 66.9 - - Holo2-8B 58.9 93.2 70.1 - - Holo2-4B 57.2 93.2 69.4 - - GUI-Owl-7B 54.9 92.8 55.9 - - OpenCUA-7B 50.0 92.3 55.3 - 29.7 UI-Venus-1.0-7B 50.8 94.1 54.6 61.7 36.8 GTA1-7B 50.1 92.4 60.1 67.7 - UI-TARS-1.5-7B 35.7 91.6 52.8 64.2 - General VLMs Qwen3-VL-4B* 59.5 93.1 63.3 71.1 30.4 Qwen3-VL-8B 54.6 - 58.2 - -
Notes:
- By default numbers are copied from each source
*indicates the results produced by us- For all the runs produced by us, for fair comparison, the same prompt structure of
system -> user-image -> user-instructis used. Similarly, the same system message is used, which is the default Qwen3-VL computer-use format prompt and also adopted by the ScreenSpot-Pro leaderboard. For OSWorld-G and OSWorld-G-refined, minor modification is made to instruct the refusal setting.
Quickstart
The model keeps the exact same architecture and configs of Qwen3-VL-4B-Instruct and hence the usage is the same. For detail examples and the grounding prompt, please checkout the benchmark evaluation code in the kv-ground repo.
- Downloads last month
- 98