Model Card for SmolLM3-3B-GGUF

This repository contains multiple quantized versions of the SmolLM3-3B model in GGUF format.
It is intended for efficient inference on consumer hardware, making large model deployment more accessible.

Model Details

Model Description

Developed by: leeminwaan
Funded by [optional]: Independent project
Shared by [optional]: leeminwaan
Model type: Decoder-only transformer language model
Language(s) (NLP): English (primary), multilingual capabilities not benchmarked
License: Apache-2.0

Model Sources

Repository: Hugging Face Repo
Paper [optional]: Not available
Demo [optional]: To be released

How to Get Started with the Model

from huggingface_hub import hf_hub_download

model_path = hf_hub_download("leeminwaan/SmolLM3-3B-GGUF", "SmolLM3-3B-q4_k_m.gguf")
print("Downloaded:", model_path)

Quantized versions available:

Q2_K, Q3_K_S, Q3_K_M, Q3_K_L
Q4_0, Q4_1, Q4_K_S, Q4_K_M
Q5_0, Q5_1, Q5_K_S, Q5_K_M
Q6_K, Q8_0

Training Details

Training Data

Based on SmolLM3-3B pretraining corpus (public large-scale web text, open datasets).
No additional fine-tuning was performed for this release.

Training Procedure

Original SmolLM3-3B → quantized to GGUF formats.

Quantization Results

Quantization	Size (vs. FP16)	Speed	Quality	Recommended For
Q2_K	Smallest	Fastest	Low	Prototyping, minimal RAM/CPU
Q3_K_S	Very Small	Very Fast	Low-Med	Lightweight devices, testing
Q3_K_M	Small	Fast	Med	Lightweight, slightly better quality
Q3_K_L	Small-Med	Fast	Med	Faster inference, fair quality
Q4_0	Medium	Fast	Good	General use, chats, low RAM
Q4_1	Medium	Fast	Good+	Recommended, slightly better quality
Q4_K_S	Medium	Fast	Good+	Recommended, balanced
Q4_K_M	Medium	Fast	Good++	Recommended, best Q4 option
Q5_0	Larger	Moderate	Very Good	Chatbots, longer responses
Q5_1	Larger	Moderate	Very Good+	More demanding tasks
Q5_K_S	Larger	Moderate	Very Good+	Advanced users, better accuracy
Q5_K_M	Larger	Moderate	Excellent	Demanding tasks, high quality
Q6_K	Large	Slower	Near FP16	Power users, best quantized quality
Q8_0	Largest	Slowest	FP16-like	Maximum quality, high RAM/CPU

Note:

Lower quantization = smaller model, faster inference, but lower output quality.

Q4_K_M is ideal for most users; Q6_K/Q8_0 offer the highest quality, best for advanced use.

All quantizations are suitable for consumer hardware—select based on your quality/speed needs.

Technical Specifications

Software

llama.cpp for quantization
Python 3.10, huggingface_hub

Citation

BibTeX:

@miscSmolLM3-3B-GGUF,
  title=SmolLM3-3B-GGUF Quantized Models},
  author={leeminwaan},
  year={2025},
  howpublished={\url{https://huggingface.co/leeminwaan/SmolLM3-3B-GGUF}}
}

APA:

leeminwaan. (2025). SmolLM3-3B-GGUF Quantized Models [Computer software]. Hugging Face. https://huggingface.co/leeminwaan/SmolLM3-3B-GGUF

Glossary

Quantization: Reducing precision of weights to lower memory usage.
GGUF: Optimized format for llama.cpp inference.

More Information

This project is experimental.
Expect further updates and quantization benchmarks.

Model Card Authors

leeminwaan

Model Card Contact

Hugging Face: leeminwaan

Downloads last month: 3

GGUF

Model size

3B params

Architecture

smollm3

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit