KV-Ground-8B-BaseGuiOwl1.5-0315
A small GUI grounding model optimized for high-resolution images.
- Type: Vision-Language Model (VLM) for GUI grounding
- Size: 8B parameters
- Input: Image + natural language instruction
- Output: Text
- Fine-tuned from: GUI-Owl-1.5-8B-Instruct
- License: CC BY-NC-SA 4.0
- Developed by: Kingsware & Vocaela AI
Model Description
This model is developed to optimize small VLM models for high-resolution GUI grounding. We synthesize high-quality high-resolution GUI grounding data, and continue post-training GUI-Owl-1.5-8B-Instruct with SFT followed by RFT (GRPO). Without reasoning CoT, it achieves 73.2 on ScreenSpot-Pro as the best model cross the board. When combined with zoom-in strategy, it achieves 80.5 as the best system across the board. Meanwhile, it maintains excellent performance on regular-resolution tasks with 94.6 on ScreenSpot-V2.
Key receipe:
- Data clean by MLLM as judge: various public GUI grounding datasets are of ~30% errors which highly degenerate model performance on high-resolution images. And hence we carefuly do multiple-rounds of data cleaning by MLLM-as-judge.
- Synthesize high-resolution GUI grounding data
- Continue post-training model through SFT and GRPO
Benchmark Results
Impact of continual post-training on base models
For the purpose of controlled comparison, all these numbers are re-/produced by us, using the same evaluation code in the repo kv-ground. The baseline numbers may be different from the sources. Please see below
notesection for the controlled setup.Models ScreenSpot-Pro ScreenSpot-V2 OSWorld-G OSWorld-G-refined UI-Vision Base: GUI-Owl-1.5-8B-Instruct* 70.5 93.5 64.7 67.9 38.3 KV-Ground-8B-BaseGuiOwl1.5-0315* 73.2 (+2.7) 94.6 (+1.1) 68.1 (+3.4) 70.9 (+3.0) 39.6 (+1.3) The results tell us:
- Our continual post-training method brings in consistent improvements. It is beyond our expectation since the technical report of GUI-Owl-1.5-8B-Instruct discloses that they have already applied extensive data synthesize / augementation targeted for high-resolution images.
- The high-resolution optimized training doesn't harm regular resolution tasks. Instead, it also brings in notable gain on OSWorld-G / OSWorld-G-refined.
Comparision with top models (ranked by ScreenSpot-Pro)
We consider all top models from ScreenSpot-Pro leaderboard and related most recent technical reports. We only compare pure model capabillity and hence exclude those multi-step methods such as using zoom-in, MVP or agentic frameworks.
On ScreenSpot-Pro, KV-Ground-8B-BaseGuiOwl1.5-0315 is ranked as the best among all the models.
Models ScreenSpot-Pro ScreenSpot-V2 OSWorld-G OSWorld-G-refined UI-Vision Specialized GUI Models KV-Ground-8B-BaseGuiOwl1.5-0315* 73.2 94.6 68.1 70.9 39.6 GUI-Owl-1.5-32B-Instruct* 71.3 95.1 67.0 71.8 37.5 Holo2-235B-A22B 70.6 95.9 79.0 - - GUI-Owl-1.5-8B-Instruct* 70.5 93.5 64.7 67.9 38.3 UI-Venus-1.5-30B-A3B 69.6 96.2 70.6 76.4 54.7 UI-Venus-1.5-8B 68.4 95.9 69.7 74.1 46.5 MAI-UI-32B 67.9 96.5 67.6 73.9 47.1 KV-Ground-4B-BaseGuiOwl1.5-0228* 67.0 94.1 64.2 69.5 33.3 Holo2-30B-A3B 66.1 94.9 76.1 - - MAI-UI-8B 65.8 95.2 60.1 68.6 40.7 GUI-Owl-1.5-4B-Instruct* 65.3 92.8 61.7 66.8 30.4 KV-Ground-4B-BaseQw3vl* 63.2 94.6 64.0 71.2 32.6 Step-GUI-8B 62.6 95.1 70.0 - - Step-GUI-4B 60.0 93.6 66.9 - - Holo2-8B 58.9 93.2 70.1 - - Holo2-4B 57.2 93.2 69.4 - - GUI-Owl-7B 54.9 92.8 55.9 - - OpenCUA-7B 50.0 92.3 55.3 - 29.7 UI-Venus-1.0-7B 50.8 94.1 54.6 61.7 36.8 GTA1-7B 50.1 92.4 60.1 67.7 - UI-TARS-1.5-7B 35.7 91.6 52.8 64.2 - General VLMs Qwen3-VL-4B* 59.5 93.1 63.3 71.1 30.4 Qwen3-VL-8B 54.6 - 58.2 - - Comparision with agentic approaches on ScreenSpot-Pro
We list out top 10 players reported on ScreenSpot-Pro leaderboard and related technical reports. KV-Ground-8B-BaseGuiOwl1.5-0315 + Zoom-In is ranked as best in all the systems.
Model / Agentic ScreenSpot-Pro KV-Ground-8B-BaseGuiOwl1.5-0315 + Zoom-In* 80.5 GUI-Owl-1.5-32B-Instruct + Zoom-In 80.3 Holo2-235B-A22B (Agentic) 78.5 GUI-Owl-1.5-8B-Instruct + Zoom-In* 78.0 MAI-UI-32B (MVP) 77.5 KV-Ground-4B-BaseGuiOwl1.5-0228 + Zoom-In* 76.4 GUI-Owl-1.5-4B-Instruct + Zoom-In* 76.1 Holo2-30B-A3B (Agentic) 75.2 MVP_Qwen3VL-32B 74.1 MAI-UI-32B (Zoom In) 73.5
Notes:
- By default numbers are from each source. Those numbers produced by us are indicated by
*, which may be different from the numbers reported from the sources.- It is known that models are usually sensitive to certain pre-designed prompt. As all the runs produced by us are based on Qwen3-VL backbone, for fair and simple comparision, the same prompt structure of
system -> user-image -> user-instructis used. And the same system message is used, which is the default Qwen3-VL computer-use format prompt adopted by the ScreenSpot-Pro leaderboard. For OSWorld-G and OSWorld-G-refined, minor modification is made to instruct the refusal setting.
Quickstart
The model keeps the exact same architecture and configs of GUI-Owl-1.5-8B-Instruct (inheriting from Qwen3-VL-8B-Instruct) and hence the usage is the same. For detail examples and the grounding prompt, please checkout the benchmark evaluation code in this repo.
- Downloads last month
- 998