SmolVLM2-500M GGUF
This is a GGUF conversion of HuggingFaceTB/SmolVLM2-500M-Video-Instruct - a compact Vision-Language Model optimized for on-device inference with llama.cpp.
Model Details
| Property | Value |
|---|---|
| Original Model | SmolVLM2-500M-Video-Instruct |
| Parameters | 500 million |
| Quantization | Q8_0 |
| Model Size | ~437 MB |
| Vision Encoder Size | ~199 MB (F16) |
| Context Window | 8,192 tokens |
| Architecture | SmolVLM2 with SigLIP vision encoder |
Files
SmolVLM2-500M-Video-Instruct-Q8_0.gguf- Main language modelmmproj-SmolVLM2-500M-Video-Instruct-f16.gguf- Vision encoder (mmproj)
Intended Use
This model is optimized for:
- Mobile/Edge Deployment: Runs efficiently on all iOS devices
- llama.cpp Integration: Compatible with llama.cpp vision features
- On-Device AI: Private, offline image understanding with minimal resources
Capabilities
- Image Captioning: Describe images accurately
- Visual Q&A: Answer questions about images
- Document Extraction: Extract text from photos
- Scene Understanding: Analyze visual content
- Fast Inference: 15-20 tokens/sec on iPhone 15 Pro
Usage with llama.cpp
./llama-llava-cli -m SmolVLM2-500M-Video-Instruct-Q8_0.gguf \
--mmproj mmproj-SmolVLM2-500M-Video-Instruct-f16.gguf \
--image your_image.jpg \
-p "Describe this image"
Prompt Format
<image>
User: {prompt}
Assistant:
License
This model inherits the Apache 2.0 license from the original SmolVLM2 model.
Attribution
- Original Model: SmolVLM2-500M-Video-Instruct by Hugging Face
- GGUF Conversion: ggml-org, hosted by jc-builds
- Downloads last month
- 162
Hardware compatibility
Log In to add your hardware
8-bit
Model tree for jc-builds/smolvlm2-500m-gguf
Base model
HuggingFaceTB/SmolLM2-360M Quantized
HuggingFaceTB/SmolLM2-360M-Instruct Quantized
HuggingFaceTB/SmolVLM-500M-Instruct