Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF

Quantized GGUF versions of Fu01978/Llama-3.2-3B-MoE-4Expert for efficient local inference with llama.cpp and compatible tools.

Model Description

This repository contains GGUF quantizations of a 4-expert MoE model specializing in:

  • General chat & explanations
  • Code & programming
  • Creative writing
  • Mathematics

For full model details, see the original model card.

Available Files

Quant Type Size Use Case
BF16 19.1 GB Maximum quality, high VRAM
Q4_K_M 5.86 GB Best balance of quality/size

Quantization Details

  • BF16: Full precision GGUF format - identical quality to original model
  • Q4_K_M: 4-bit quantization with medium quality - recommended for most users

Usage

llama.cpp

# Download the model
huggingface-cli download Fu01978/Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF Llama-3.2-3B-MoE-4Expert.Q4_K_M.gguf --local-dir .

# Run with llama.cpp
./llama-cli -m Llama-3.2-3B-MoE-4Expert.Q4_K_M.gguf -p "Write a Python function to reverse a string" -n 512

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="Llama-3.2-3B-MoE-4Expert.Q4_K_M.gguf",
    n_ctx=2048,
    n_threads=8,
)

output = llm(
    "Explain quantum entanglement in simple terms",
    max_tokens=512,
    temperature=0.7,
)
print(output['choices'][0]['text'])

Performance Notes

The Q4_K_M quantization provides excellent quality with minimal degradation compared to the original model while using ~65% less disk space and memory. Recommended for most use cases. The BF16 version maintains full original quality and is recommended if you have sufficient VRAM/RAM.

Conversion Details

  • Original Model: Fu01978/Llama-3.2-3B-MoE-4Expert
  • Conversion Tool: llama.cpp
  • Quantization Method: Q4_K_M via llama.cpp quantization — BF16 via llama.cpp convert_hf_to_gguf.py script

Acknowledgments

  • Original MoE model by Fu01978
  • GGUF conversion using llama.cpp
  • Base models: Meta AI (Llama 3.2), unsloth, prithivMLmods, DavidAU
Downloads last month
36
GGUF
Model size
10B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Fu01978/Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF

Quantized
(3)
this model

Collection including Fu01978/Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF