SmolVLM2-500M GGUF

This is a GGUF conversion of HuggingFaceTB/SmolVLM2-500M-Video-Instruct - a compact Vision-Language Model optimized for on-device inference with llama.cpp.

Model Details

Property Value
Original Model SmolVLM2-500M-Video-Instruct
Parameters 500 million
Quantization Q8_0
Model Size ~437 MB
Vision Encoder Size ~199 MB (F16)
Context Window 8,192 tokens
Architecture SmolVLM2 with SigLIP vision encoder

Files

  • SmolVLM2-500M-Video-Instruct-Q8_0.gguf - Main language model
  • mmproj-SmolVLM2-500M-Video-Instruct-f16.gguf - Vision encoder (mmproj)

Intended Use

This model is optimized for:

  • Mobile/Edge Deployment: Runs efficiently on all iOS devices
  • llama.cpp Integration: Compatible with llama.cpp vision features
  • On-Device AI: Private, offline image understanding with minimal resources

Capabilities

  • Image Captioning: Describe images accurately
  • Visual Q&A: Answer questions about images
  • Document Extraction: Extract text from photos
  • Scene Understanding: Analyze visual content
  • Fast Inference: 15-20 tokens/sec on iPhone 15 Pro

Usage with llama.cpp

./llama-llava-cli -m SmolVLM2-500M-Video-Instruct-Q8_0.gguf \
  --mmproj mmproj-SmolVLM2-500M-Video-Instruct-f16.gguf \
  --image your_image.jpg \
  -p "Describe this image"

Prompt Format

<image>
User: {prompt}
Assistant:

License

This model inherits the Apache 2.0 license from the original SmolVLM2 model.

Attribution

Downloads last month
162
GGUF
Model size
0.4B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jc-builds/smolvlm2-500m-gguf