KV-Ground-4B-BaseQw3vl

A small GUI grounding model optimized for high-resolution images.

  • Type: Vision-Language Model (VLM) for GUI grounding
  • Size: 4B parameters
  • Input: Image + natural language instruction
  • Output: Text
  • Fine-tuned from: Qwen3-VL-4B-Instruct
  • License: CC BY-NC-SA 4.0
  • Developed by: Kingsware & Vocaela AI

GitHub Hugging Face

Model Description

This model is developed to optimize small VLM models for high-resolution GUI grounding. We synthesize high-quality high-resolution GUI grounding data, and continue post-training Qwen3-VL-4B-Instruct with SFT followed by RFT (GRPO). Without reasoning CoT, it achieves 63.2 on ScreenSpot-Pro, becoming one of the best model at 4B range. Meanwhile, it maintains excellent performance on regular-resolution tasks with 94.6 on ScreenSpot-V2.

Key receipe:

  • Data clean by MLLM as judge: various public GUI grounding datasets are of ~30% errors which highly degenerate model performance on high-resolution images. And hence we carefuly do multiple-rounds of data cleaning by MLLM-as-judge.
  • Synthesize high-resolution GUI grounding data
  • Continue post-training model through SFT and GRPO

Benchmark Results

  • Impact of continue post-training on base models

    For the purpose of controlled comparison, all these numbers are re-/produced by us, using the same evaluation code in the repo kv-ground. The baseline numbers may be different from the sources. Please see below note section for the controlled setup.

    Models ScreenSpot-Pro ScreenSpot-V2 OSWorld-G OSWorld-G-refined UI-Vision
    Base: Qwen3-VL-4B-Instruct* 59.5 93.1 63.3 71.1 30.4
    KV-Ground-4B-BaseQw3vl* 63.2 (+3.7) 94.6 (+1.5) 64.0 (0.7) 71.2 (+0.1) 32.6 (+2.2)

    The results tell us:

    • Our continuous post-training method brings in consistent improvements
    • The high-resolution optimized training doesn't harm regular resolution tasks
  • Comparision with top models under 8B (ranked by ScreenSpot-Pro)

    We consider the top models under 8B from ScreenSpot-Pro leaderboard and related most recent technical reports.

    Models ScreenSpot-Pro ScreenSpot-V2 OSWorld-G OSWorld-G-refined UI-Vision
    Specialized GUI Models
    UI-Venus-1.5-8B 68.4 93.2 69.4 - -
    KV-Ground-4B-BaseGuiOwl1.5* 66.5 94.3 62.8 69.1 32.2
    MAI-UI-8B 65.8 95.2 60.1 68.6 40.7
    GUI-Owl-1.5-4B-Instruct* 65.3 92.8 61.7 66.8 30.4
    KV-Ground-4B-BaseQw3vl* 63.2 94.6 64.0 71.2 32.6
    Step-GUI-8B 62.6 95.1 70.0 - -
    Step-GUI-4B 60.0 93.6 66.9 - -
    Holo2-8B 58.9 93.2 70.1 - -
    Holo2-4B 57.2 93.2 69.4 - -
    GUI-Owl-7B 54.9 92.8 55.9 - -
    OpenCUA-7B 50.0 92.3 55.3 - 29.7
    UI-Venus-1.0-7B 50.8 94.1 54.6 61.7 36.8
    GTA1-7B 50.1 92.4 60.1 67.7 -
    UI-TARS-1.5-7B 35.7 91.6 52.8 64.2 -
    General VLMs
    Qwen3-VL-4B* 59.5 93.1 63.3 71.1 30.4
    Qwen3-VL-8B 54.6 - 58.2 - -

Notes:

  • By default numbers are copied from each source
  • * indicates the results produced by us
  • For all the runs produced by us, for fair comparison, the same prompt structure of system -> user-image -> user-instruct is used. Similarly, the same system message is used, which is the default Qwen3-VL computer-use format prompt and also adopted by the ScreenSpot-Pro leaderboard. For OSWorld-G and OSWorld-G-refined, minor modification is made to instruct the refusal setting.

Quickstart

The model keeps the exact same architecture and configs of Qwen3-VL-4B-Instruct and hence the usage is the same. For detail examples and the grounding prompt, please checkout the benchmark evaluation code in the kv-ground repo.

Downloads last month
98
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vocaela/KV-Ground-4B-BaseQw3vl

Finetuned
(239)
this model
Quantizations
1 model

Collection including vocaela/KV-Ground-4B-BaseQw3vl