Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF

Quantized GGUF versions of Fu01978/Llama-3.2-3B-MoE-4Expert for efficient local inference with llama.cpp and compatible tools.

Model Description

This repository contains GGUF quantizations of a 4-expert MoE model specializing in:

General chat & explanations
Code & programming
Creative writing
Mathematics

For full model details, see the original model card.

Available Files

Quant Type	Size	Use Case
BF16	19.1 GB	Maximum quality, high VRAM
Q4_K_M	5.86 GB	Best balance of quality/size

Quantization Details

BF16: Full precision GGUF format - identical quality to original model
Q4_K_M: 4-bit quantization with medium quality - recommended for most users

Usage

llama.cpp

# Download the model
huggingface-cli download Fu01978/Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF Llama-3.2-3B-MoE-4Expert.Q4_K_M.gguf --local-dir .

# Run with llama.cpp
./llama-cli -m Llama-3.2-3B-MoE-4Expert.Q4_K_M.gguf -p "Write a Python function to reverse a string" -n 512

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="Llama-3.2-3B-MoE-4Expert.Q4_K_M.gguf",
    n_ctx=2048,
    n_threads=8,
)

output = llm(
    "Explain quantum entanglement in simple terms",
    max_tokens=512,
    temperature=0.7,
)
print(output['choices'][0]['text'])

Performance Notes

The Q4_K_M quantization provides excellent quality with minimal degradation compared to the original model while using ~65% less disk space and memory. Recommended for most use cases. The BF16 version maintains full original quality and is recommended if you have sufficient VRAM/RAM.

Conversion Details

Original Model: Fu01978/Llama-3.2-3B-MoE-4Expert
Conversion Tool: llama.cpp
Quantization Method: Q4_K_M via llama.cpp quantization — BF16 via llama.cpp convert_hf_to_gguf.py script

Acknowledgments

Original MoE model by Fu01978
GGUF conversion using llama.cpp
Base models: Meta AI (Llama 3.2), unsloth, prithivMLmods, DavidAU

Downloads last month: 36

GGUF

Model size

10B params

Architecture

llama

Hardware compatibility

4-bit

16-bit

Model tree for Fu01978/Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF

Base model

Fu01978/Llama-3.2-3B-MoE-4Expert

Quantized

(3)

this model

Collection including Fu01978/Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF

Quantizations

Collection

All GGUF quants that I have made so far, and demos too. • 6 items • Updated Mar 2