KV-Ground-8B-BaseGuiOwl1.5-0315

A small GUI grounding model optimized for high-resolution images.

  • Type: Vision-Language Model (VLM) for GUI grounding
  • Size: 8B parameters
  • Input: Image + natural language instruction
  • Output: Text
  • Fine-tuned from: GUI-Owl-1.5-8B-Instruct
  • License: CC BY-NC-SA 4.0
  • Developed by: Kingsware & Vocaela AI

GitHub Hugging Face

Model Description

This model is developed to optimize small VLM models for high-resolution GUI grounding. We synthesize high-quality high-resolution GUI grounding data, and continue post-training GUI-Owl-1.5-8B-Instruct with SFT followed by RFT (GRPO). Without reasoning CoT, it achieves 73.2 on ScreenSpot-Pro as the best model cross the board. When combined with zoom-in strategy, it achieves 80.5 as the best system across the board. Meanwhile, it maintains excellent performance on regular-resolution tasks with 94.6 on ScreenSpot-V2.

Key receipe:

  • Data clean by MLLM as judge: various public GUI grounding datasets are of ~30% errors which highly degenerate model performance on high-resolution images. And hence we carefuly do multiple-rounds of data cleaning by MLLM-as-judge.
  • Synthesize high-resolution GUI grounding data
  • Continue post-training model through SFT and GRPO

Benchmark Results

  • Impact of continual post-training on base models

    For the purpose of controlled comparison, all these numbers are re-/produced by us, using the same evaluation code in the repo kv-ground. The baseline numbers may be different from the sources. Please see below note section for the controlled setup.

    Models ScreenSpot-Pro ScreenSpot-V2 OSWorld-G OSWorld-G-refined UI-Vision
    Base: GUI-Owl-1.5-8B-Instruct* 70.5 93.5 64.7 67.9 38.3
    KV-Ground-8B-BaseGuiOwl1.5-0315* 73.2 (+2.7) 94.6 (+1.1) 68.1 (+3.4) 70.9 (+3.0) 39.6 (+1.3)

    The results tell us:

    • Our continual post-training method brings in consistent improvements. It is beyond our expectation since the technical report of GUI-Owl-1.5-8B-Instruct discloses that they have already applied extensive data synthesize / augementation targeted for high-resolution images.
    • The high-resolution optimized training doesn't harm regular resolution tasks. Instead, it also brings in notable gain on OSWorld-G / OSWorld-G-refined.
  • Comparision with top models (ranked by ScreenSpot-Pro)

    We consider all top models from ScreenSpot-Pro leaderboard and related most recent technical reports. We only compare pure model capabillity and hence exclude those multi-step methods such as using zoom-in, MVP or agentic frameworks.

    On ScreenSpot-Pro, KV-Ground-8B-BaseGuiOwl1.5-0315 is ranked as the best among all the models.

    Models ScreenSpot-Pro ScreenSpot-V2 OSWorld-G OSWorld-G-refined UI-Vision
    Specialized GUI Models
    KV-Ground-8B-BaseGuiOwl1.5-0315* 73.2 94.6 68.1 70.9 39.6
    GUI-Owl-1.5-32B-Instruct* 71.3 95.1 67.0 71.8 37.5
    Holo2-235B-A22B 70.6 95.9 79.0 - -
    GUI-Owl-1.5-8B-Instruct* 70.5 93.5 64.7 67.9 38.3
    UI-Venus-1.5-30B-A3B 69.6 96.2 70.6 76.4 54.7
    UI-Venus-1.5-8B 68.4 95.9 69.7 74.1 46.5
    MAI-UI-32B 67.9 96.5 67.6 73.9 47.1
    KV-Ground-4B-BaseGuiOwl1.5-0228* 67.0 94.1 64.2 69.5 33.3
    Holo2-30B-A3B 66.1 94.9 76.1 - -
    MAI-UI-8B 65.8 95.2 60.1 68.6 40.7
    GUI-Owl-1.5-4B-Instruct* 65.3 92.8 61.7 66.8 30.4
    KV-Ground-4B-BaseQw3vl* 63.2 94.6 64.0 71.2 32.6
    Step-GUI-8B 62.6 95.1 70.0 - -
    Step-GUI-4B 60.0 93.6 66.9 - -
    Holo2-8B 58.9 93.2 70.1 - -
    Holo2-4B 57.2 93.2 69.4 - -
    GUI-Owl-7B 54.9 92.8 55.9 - -
    OpenCUA-7B 50.0 92.3 55.3 - 29.7
    UI-Venus-1.0-7B 50.8 94.1 54.6 61.7 36.8
    GTA1-7B 50.1 92.4 60.1 67.7 -
    UI-TARS-1.5-7B 35.7 91.6 52.8 64.2 -
    General VLMs
    Qwen3-VL-4B* 59.5 93.1 63.3 71.1 30.4
    Qwen3-VL-8B 54.6 - 58.2 - -
  • Comparision with agentic approaches on ScreenSpot-Pro

    We list out top 10 players reported on ScreenSpot-Pro leaderboard and related technical reports. KV-Ground-8B-BaseGuiOwl1.5-0315 + Zoom-In is ranked as best in all the systems.

    Model / Agentic ScreenSpot-Pro
    KV-Ground-8B-BaseGuiOwl1.5-0315 + Zoom-In* 80.5
    GUI-Owl-1.5-32B-Instruct + Zoom-In 80.3
    Holo2-235B-A22B (Agentic) 78.5
    GUI-Owl-1.5-8B-Instruct + Zoom-In* 78.0
    MAI-UI-32B (MVP) 77.5
    KV-Ground-4B-BaseGuiOwl1.5-0228 + Zoom-In* 76.4
    GUI-Owl-1.5-4B-Instruct + Zoom-In* 76.1
    Holo2-30B-A3B (Agentic) 75.2
    MVP_Qwen3VL-32B 74.1
    MAI-UI-32B (Zoom In) 73.5

Notes:

  • By default numbers are from each source. Those numbers produced by us are indicated by *, which may be different from the numbers reported from the sources.
  • It is known that models are usually sensitive to certain pre-designed prompt. As all the runs produced by us are based on Qwen3-VL backbone, for fair and simple comparision, the same prompt structure of system -> user-image -> user-instruct is used. And the same system message is used, which is the default Qwen3-VL computer-use format prompt adopted by the ScreenSpot-Pro leaderboard. For OSWorld-G and OSWorld-G-refined, minor modification is made to instruct the refusal setting.

Quickstart

The model keeps the exact same architecture and configs of GUI-Owl-1.5-8B-Instruct (inheriting from Qwen3-VL-8B-Instruct) and hence the usage is the same. For detail examples and the grounding prompt, please checkout the benchmark evaluation code in this repo.

Downloads last month
998
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for vocaela/KV-Ground-8B-BaseGuiOwl1.5-0315

Finetuned
(1)
this model
Quantizations
2 models

Collection including vocaela/KV-Ground-8B-BaseGuiOwl1.5-0315