smolvlm-500m-ccmcp-v1

GUI grounding model trained on ScreenSpot dataset for Claude-compatible computer use.

Files

File	Description	Size
`mmproj-smolvlm-500m-ccmcp-v1-f16.gguf`	Vision projector (F16)	190.2 MB
`smolvlm-500m-ccmcp-v1-Q4_K_M.gguf`	Main model (Q4_K_M)	289.2 MB
`smolvlm-500m-ccmcp-v1-f16.gguf`	Main model (F16)	782.4 MB

Training

Base Model: HuggingFaceTB/SmolVLM-500M-Instruct
Dataset: ScreenSpot GUI grounding (1,017 examples)
Method: LoRA fine-tuning (r=16, alpha=32)
Task: Predict click coordinates in Claude format

Output Format

{"action": "left_click", "coordinate": [847, 523]}

Usage with Ollama

# Modelfile
FROM ./smolvlm-500m-ccmcp-v1-Q4_K_M.gguf
FROM ./mmproj-smolvlm-500m-ccmcp-v1-f16.gguf
PARAMETER num_ctx 4096
PARAMETER temperature 0.1
SYSTEM "You are a GUI grounding assistant. Given a screenshot and instruction, output click coordinates as JSON."

ollama create smolvlm_500m_ccmcp_v1 -f Modelfile
ollama run smolvlm_500m_ccmcp_v1 --image screenshot.png "Click the Submit button"

License

Apache 2.0 (inherits from base model)

Downloads last month: 13

GGUF

Model size

0.4B params

Architecture

llama

Hardware compatibility

4-bit

16-bit

Model tree for pierretokns/smolvlm-500m-ccmcp-v1

Base model

HuggingFaceTB/SmolLM2-360M

Quantized

HuggingFaceTB/SmolLM2-360M-Instruct

Quantized

HuggingFaceTB/SmolVLM-500M-Instruct

Quantized

(22)

this model