smolvlm-500m-ccmcp-v1

GUI grounding model trained on ScreenSpot dataset for Claude-compatible computer use.

Files

File Description Size
mmproj-smolvlm-500m-ccmcp-v1-f16.gguf Vision projector (F16) 190.2 MB
smolvlm-500m-ccmcp-v1-Q4_K_M.gguf Main model (Q4_K_M) 289.2 MB
smolvlm-500m-ccmcp-v1-f16.gguf Main model (F16) 782.4 MB

Training

  • Base Model: HuggingFaceTB/SmolVLM-500M-Instruct
  • Dataset: ScreenSpot GUI grounding (1,017 examples)
  • Method: LoRA fine-tuning (r=16, alpha=32)
  • Task: Predict click coordinates in Claude format

Output Format

{"action": "left_click", "coordinate": [847, 523]}

Usage with Ollama

# Modelfile
FROM ./smolvlm-500m-ccmcp-v1-Q4_K_M.gguf
FROM ./mmproj-smolvlm-500m-ccmcp-v1-f16.gguf
PARAMETER num_ctx 4096
PARAMETER temperature 0.1
SYSTEM "You are a GUI grounding assistant. Given a screenshot and instruction, output click coordinates as JSON."
ollama create smolvlm_500m_ccmcp_v1 -f Modelfile
ollama run smolvlm_500m_ccmcp_v1 --image screenshot.png "Click the Submit button"

License

Apache 2.0 (inherits from base model)

Downloads last month
13
GGUF
Model size
0.4B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for pierretokns/smolvlm-500m-ccmcp-v1