SmolVLM2-500M GGUF

This is a GGUF conversion of HuggingFaceTB/SmolVLM2-500M-Video-Instruct - a compact Vision-Language Model optimized for on-device inference with llama.cpp.

Model Details

Property	Value
Original Model	SmolVLM2-500M-Video-Instruct
Parameters	500 million
Quantization	Q8_0
Model Size	~437 MB
Vision Encoder Size	~199 MB (F16)
Context Window	8,192 tokens
Architecture	SmolVLM2 with SigLIP vision encoder

Files

SmolVLM2-500M-Video-Instruct-Q8_0.gguf - Main language model
mmproj-SmolVLM2-500M-Video-Instruct-f16.gguf - Vision encoder (mmproj)

Intended Use

This model is optimized for:

Mobile/Edge Deployment: Runs efficiently on all iOS devices
llama.cpp Integration: Compatible with llama.cpp vision features
On-Device AI: Private, offline image understanding with minimal resources

Capabilities

Image Captioning: Describe images accurately
Visual Q&A: Answer questions about images
Document Extraction: Extract text from photos
Scene Understanding: Analyze visual content
Fast Inference: 15-20 tokens/sec on iPhone 15 Pro

Usage with llama.cpp

./llama-llava-cli -m SmolVLM2-500M-Video-Instruct-Q8_0.gguf \
  --mmproj mmproj-SmolVLM2-500M-Video-Instruct-f16.gguf \
  --image your_image.jpg \
  -p "Describe this image"