Quantizations
Collection
All GGUF quants that I have made so far, and demos too. • 6 items • Updated
Quantized GGUF versions of Fu01978/Llama-3.2-3B-MoE-4Expert for efficient local inference with llama.cpp and compatible tools.
This repository contains GGUF quantizations of a 4-expert MoE model specializing in:
For full model details, see the original model card.
| Quant Type | Size | Use Case |
|---|---|---|
| BF16 | 19.1 GB | Maximum quality, high VRAM |
| Q4_K_M | 5.86 GB | Best balance of quality/size |
# Download the model
huggingface-cli download Fu01978/Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF Llama-3.2-3B-MoE-4Expert.Q4_K_M.gguf --local-dir .
# Run with llama.cpp
./llama-cli -m Llama-3.2-3B-MoE-4Expert.Q4_K_M.gguf -p "Write a Python function to reverse a string" -n 512
from llama_cpp import Llama
llm = Llama(
model_path="Llama-3.2-3B-MoE-4Expert.Q4_K_M.gguf",
n_ctx=2048,
n_threads=8,
)
output = llm(
"Explain quantum entanglement in simple terms",
max_tokens=512,
temperature=0.7,
)
print(output['choices'][0]['text'])
The Q4_K_M quantization provides excellent quality with minimal degradation compared to the original model while using ~65% less disk space and memory. Recommended for most use cases. The BF16 version maintains full original quality and is recommended if you have sufficient VRAM/RAM.
convert_hf_to_gguf.py script4-bit
16-bit
Base model
Fu01978/Llama-3.2-3B-MoE-4Expert