Model Card for SmolLM3-3B-GGUF
This repository contains multiple quantized versions of the SmolLM3-3B model in GGUF format.
It is intended for efficient inference on consumer hardware, making large model deployment more accessible.
Model Details
Model Description
- Developed by: leeminwaan
- Funded by [optional]: Independent project
- Shared by [optional]: leeminwaan
- Model type: Decoder-only transformer language model
- Language(s) (NLP): English (primary), multilingual capabilities not benchmarked
- License: Apache-2.0
Model Sources
- Repository: Hugging Face Repo
- Paper [optional]: Not available
- Demo [optional]: To be released
How to Get Started with the Model
from huggingface_hub import hf_hub_download
model_path = hf_hub_download("leeminwaan/SmolLM3-3B-GGUF", "SmolLM3-3B-q4_k_m.gguf")
print("Downloaded:", model_path)
Quantized versions available:
- Q2_K, Q3_K_S, Q3_K_M, Q3_K_L
- Q4_0, Q4_1, Q4_K_S, Q4_K_M
- Q5_0, Q5_1, Q5_K_S, Q5_K_M
- Q6_K, Q8_0
Training Details
Training Data
- Based on SmolLM3-3B pretraining corpus (public large-scale web text, open datasets).
- No additional fine-tuning was performed for this release.
Training Procedure
- Original SmolLM3-3B β quantized to GGUF formats.
Quantization Results
| Quantization | Size (vs. FP16) | Speed | Quality | Recommended For |
|---|---|---|---|---|
| Q2_K | Smallest | Fastest | Low | Prototyping, minimal RAM/CPU |
| Q3_K_S | Very Small | Very Fast | Low-Med | Lightweight devices, testing |
| Q3_K_M | Small | Fast | Med | Lightweight, slightly better quality |
| Q3_K_L | Small-Med | Fast | Med | Faster inference, fair quality |
| Q4_0 | Medium | Fast | Good | General use, chats, low RAM |
| Q4_1 | Medium | Fast | Good+ | Recommended, slightly better quality |
| Q4_K_S | Medium | Fast | Good+ | Recommended, balanced |
| Q4_K_M | Medium | Fast | Good++ | Recommended, best Q4 option |
| Q5_0 | Larger | Moderate | Very Good | Chatbots, longer responses |
| Q5_1 | Larger | Moderate | Very Good+ | More demanding tasks |
| Q5_K_S | Larger | Moderate | Very Good+ | Advanced users, better accuracy |
| Q5_K_M | Larger | Moderate | Excellent | Demanding tasks, high quality |
| Q6_K | Large | Slower | Near FP16 | Power users, best quantized quality |
| Q8_0 | Largest | Slowest | FP16-like | Maximum quality, high RAM/CPU |
Note:
- Lower quantization = smaller model, faster inference, but lower output quality.
- Q4_K_M is ideal for most users; Q6_K/Q8_0 offer the highest quality, best for advanced use.
- All quantizations are suitable for consumer hardwareβselect based on your quality/speed needs.
Technical Specifications
Software
- llama.cpp for quantization
- Python 3.10, huggingface_hub
Citation
BibTeX:
@miscSmolLM3-3B-GGUF,
title=SmolLM3-3B-GGUF Quantized Models},
author={leeminwaan},
year={2025},
howpublished={\url{https://huggingface.co/leeminwaan/SmolLM3-3B-GGUF}}
}
APA:
leeminwaan. (2025). SmolLM3-3B-GGUF Quantized Models [Computer software]. Hugging Face. https://huggingface.co/leeminwaan/SmolLM3-3B-GGUF
Glossary
- Quantization: Reducing precision of weights to lower memory usage.
- GGUF: Optimized format for llama.cpp inference.
More Information
- This project is experimental.
- Expect further updates and quantization benchmarks.
Model Card Authors
- leeminwaan
Model Card Contact
- Hugging Face: leeminwaan
- Downloads last month
- 3
Hardware compatibility
Log In to add your hardware
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit