KV-Ground-8B-BaseGuiOwl1.5-0315

A small GUI grounding model optimized for high-resolution images.

Type: Vision-Language Model (VLM) for GUI grounding
Size: 8B parameters
Input: Image + natural language instruction
Output: Text
Fine-tuned from: GUI-Owl-1.5-8B-Instruct
License: CC BY-NC-SA 4.0
Developed by: Kingsware & Vocaela AI

Model Description

This model is developed to optimize small VLM models for high-resolution GUI grounding. We synthesize high-quality high-resolution GUI grounding data, and continue post-training GUI-Owl-1.5-8B-Instruct with SFT followed by RFT (GRPO). Without reasoning CoT, it achieves 73.2 on ScreenSpot-Pro as the best model cross the board. When combined with zoom-in strategy, it achieves 80.5 as the best system across the board. Meanwhile, it maintains excellent performance on regular-resolution tasks with 94.6 on ScreenSpot-V2.

Key receipe:

Data clean by MLLM as judge: various public GUI grounding datasets are of ~30% errors which highly degenerate model performance on high-resolution images. And hence we carefuly do multiple-rounds of data cleaning by MLLM-as-judge.
Synthesize high-resolution GUI grounding data
Continue post-training model through SFT and GRPO

Benchmark Results

Impact of continual post-training on base models

For the purpose of controlled comparison, all these numbers are re-/produced by us, using the same evaluation code in the repo kv-ground. The baseline numbers may be different from the sources. Please see below note section for the controlled setup.

Models	ScreenSpot-Pro	ScreenSpot-V2	OSWorld-G	OSWorld-G-refined	UI-Vision
Base: GUI-Owl-1.5-8B-Instruct*	70.5	93.5	64.7	67.9	38.3
KV-Ground-8B-BaseGuiOwl1.5-0315*	73.2 (+2.7)	94.6 (+1.1)	68.1 (+3.4)	70.9 (+3.0)	39.6 (+1.3)

The results tell us:

Our continual post-training method brings in consistent improvements. It is beyond our expectation since the technical report of GUI-Owl-1.5-8B-Instruct discloses that they have already applied extensive data synthesize / augementation targeted for high-resolution images.
The high-resolution optimized training doesn't harm regular resolution tasks. Instead, it also brings in notable gain on OSWorld-G / OSWorld-G-refined.

Comparision with top models (ranked by ScreenSpot-Pro)

We consider all top models from ScreenSpot-Pro leaderboard and related most recent technical reports. We only compare pure model capabillity and hence exclude those multi-step methods such as using zoom-in, MVP or agentic frameworks.

On ScreenSpot-Pro, KV-Ground-8B-BaseGuiOwl1.5-0315 is ranked as the best among all the models.

Models	ScreenSpot-Pro	ScreenSpot-V2	OSWorld-G	OSWorld-G-refined	UI-Vision
Specialized GUI Models
KV-Ground-8B-BaseGuiOwl1.5-0315*	73.2	94.6	68.1	70.9	39.6
GUI-Owl-1.5-32B-Instruct*	71.3	95.1	67.0	71.8	37.5
Holo2-235B-A22B	70.6	95.9	79.0	-	-
GUI-Owl-1.5-8B-Instruct*	70.5	93.5	64.7	67.9	38.3
UI-Venus-1.5-30B-A3B	69.6	96.2	70.6	76.4	54.7
UI-Venus-1.5-8B	68.4	95.9	69.7	74.1	46.5
MAI-UI-32B	67.9	96.5	67.6	73.9	47.1
KV-Ground-4B-BaseGuiOwl1.5-0228*	67.0	94.1	64.2	69.5	33.3
Holo2-30B-A3B	66.1	94.9	76.1	-	-
MAI-UI-8B	65.8	95.2	60.1	68.6	40.7
GUI-Owl-1.5-4B-Instruct*	65.3	92.8	61.7	66.8	30.4
KV-Ground-4B-BaseQw3vl*	63.2	94.6	64.0	71.2	32.6
Step-GUI-8B	62.6	95.1	70.0	-	-
Step-GUI-4B	60.0	93.6	66.9	-	-
Holo2-8B	58.9	93.2	70.1	-	-
Holo2-4B	57.2	93.2	69.4	-	-
GUI-Owl-7B	54.9	92.8	55.9	-	-
OpenCUA-7B	50.0	92.3	55.3	-	29.7
UI-Venus-1.0-7B	50.8	94.1	54.6	61.7	36.8
GTA1-7B	50.1	92.4	60.1	67.7	-
UI-TARS-1.5-7B	35.7	91.6	52.8	64.2	-
General VLMs
Qwen3-VL-4B*	59.5	93.1	63.3	71.1	30.4
Qwen3-VL-8B	54.6	-	58.2	-	-

Comparision with agentic approaches on ScreenSpot-Pro

We list out top 10 players reported on ScreenSpot-Pro leaderboard and related technical reports. KV-Ground-8B-BaseGuiOwl1.5-0315 + Zoom-In is ranked as best in all the systems.

Model / Agentic	ScreenSpot-Pro
KV-Ground-8B-BaseGuiOwl1.5-0315 + Zoom-In*	80.5
GUI-Owl-1.5-32B-Instruct + Zoom-In	80.3
Holo2-235B-A22B (Agentic)	78.5
GUI-Owl-1.5-8B-Instruct + Zoom-In*	78.0
MAI-UI-32B (MVP)	77.5
KV-Ground-4B-BaseGuiOwl1.5-0228 + Zoom-In*	76.4
GUI-Owl-1.5-4B-Instruct + Zoom-In*	76.1
Holo2-30B-A3B (Agentic)	75.2
MVP_Qwen3VL-32B	74.1
MAI-UI-32B (Zoom In)	73.5

Notes:

By default numbers are from each source. Those numbers produced by us are indicated by *, which may be different from the numbers reported from the sources.

It is known that models are usually sensitive to certain pre-designed prompt. As all the runs produced by us are based on Qwen3-VL backbone, for fair and simple comparision, the same prompt structure of system -> user-image -> user-instruct is used. And the same system message is used, which is the default Qwen3-VL computer-use format prompt adopted by the ScreenSpot-Pro leaderboard. For OSWorld-G and OSWorld-G-refined, minor modification is made to instruct the refusal setting.

Quickstart

The model keeps the exact same architecture and configs of GUI-Owl-1.5-8B-Instruct (inheriting from Qwen3-VL-8B-Instruct) and hence the usage is the same. For detail examples and the grounding prompt, please checkout the benchmark evaluation code in this repo.