KV-Ground-4B-BaseQw3vl

A small GUI grounding model optimized for high-resolution images.

Type: Vision-Language Model (VLM) for GUI grounding
Size: 4B parameters
Input: Image + natural language instruction
Output: Text
Fine-tuned from: Qwen3-VL-4B-Instruct
License: CC BY-NC-SA 4.0
Developed by: Kingsware & Vocaela AI

Model Description

This model is developed to optimize small VLM models for high-resolution GUI grounding. We synthesize high-quality high-resolution GUI grounding data, and continue post-training Qwen3-VL-4B-Instruct with SFT followed by RFT (GRPO). Without reasoning CoT, it achieves 63.2 on ScreenSpot-Pro, becoming one of the best model at 4B range. Meanwhile, it maintains excellent performance on regular-resolution tasks with 94.6 on ScreenSpot-V2.

Key receipe:

Data clean by MLLM as judge: various public GUI grounding datasets are of ~30% errors which highly degenerate model performance on high-resolution images. And hence we carefuly do multiple-rounds of data cleaning by MLLM-as-judge.
Synthesize high-resolution GUI grounding data
Continue post-training model through SFT and GRPO

Benchmark Results

Impact of continue post-training on base models

For the purpose of controlled comparison, all these numbers are re-/produced by us, using the same evaluation code in the repo kv-ground. The baseline numbers may be different from the sources. Please see below note section for the controlled setup.

Models	ScreenSpot-Pro	ScreenSpot-V2	OSWorld-G	OSWorld-G-refined	UI-Vision
Base: Qwen3-VL-4B-Instruct*	59.5	93.1	63.3	71.1	30.4
KV-Ground-4B-BaseQw3vl*	63.2 (+3.7)	94.6 (+1.5)	64.0 (0.7)	71.2 (+0.1)	32.6 (+2.2)

The results tell us:

Our continuous post-training method brings in consistent improvements
The high-resolution optimized training doesn't harm regular resolution tasks

Comparision with top models under 8B (ranked by ScreenSpot-Pro)

We consider the top models under 8B from ScreenSpot-Pro leaderboard and related most recent technical reports.

Models	ScreenSpot-Pro	ScreenSpot-V2	OSWorld-G	OSWorld-G-refined	UI-Vision
Specialized GUI Models
UI-Venus-1.5-8B	68.4	93.2	69.4	-	-
KV-Ground-4B-BaseGuiOwl1.5*	66.5	94.3	62.8	69.1	32.2
MAI-UI-8B	65.8	95.2	60.1	68.6	40.7
GUI-Owl-1.5-4B-Instruct*	65.3	92.8	61.7	66.8	30.4
KV-Ground-4B-BaseQw3vl*	63.2	94.6	64.0	71.2	32.6
Step-GUI-8B	62.6	95.1	70.0	-	-
Step-GUI-4B	60.0	93.6	66.9	-	-
Holo2-8B	58.9	93.2	70.1	-	-
Holo2-4B	57.2	93.2	69.4	-	-
GUI-Owl-7B	54.9	92.8	55.9	-	-
OpenCUA-7B	50.0	92.3	55.3	-	29.7
UI-Venus-1.0-7B	50.8	94.1	54.6	61.7	36.8
GTA1-7B	50.1	92.4	60.1	67.7	-
UI-TARS-1.5-7B	35.7	91.6	52.8	64.2	-
General VLMs
Qwen3-VL-4B*	59.5	93.1	63.3	71.1	30.4
Qwen3-VL-8B	54.6	-	58.2	-	-

Notes:

By default numbers are copied from each source

* indicates the results produced by us

For all the runs produced by us, for fair comparison, the same prompt structure of system -> user-image -> user-instruct is used. Similarly, the same system message is used, which is the default Qwen3-VL computer-use format prompt and also adopted by the ScreenSpot-Pro leaderboard. For OSWorld-G and OSWorld-G-refined, minor modification is made to instruct the refusal setting.

Quickstart

The model keeps the exact same architecture and configs of Qwen3-VL-4B-Instruct and hence the usage is the same. For detail examples and the grounding prompt, please checkout the benchmark evaluation code in the kv-ground repo.