DeepSeek V4 Flash — Q8xQ5 GGUF
Mixed-precision quantization: Q8_0 (attention, shared expert) + Q5_K (routed experts). Quality equivalent to standard Q5_K_M.
| Parameter | Value |
|---|---|
| Model | DeepSeek V4 Flash |
| Architecture | 284B total, 13B active (MoE 256 experts, top-6) |
| Format | GGUF Q8xQ5 (11 parts) |
| Size | 184 GB |
| Context | up to 1M tokens |
| Hardware target | Apple M3 Ultra 256 GB |
Features
- Mixed-precision: attention/shared expert/router in Q8_0, experts in Q5_K
- Per-layer MoE offload:
--moe-hot-count autokeeps only needed experts in RAM- Typical: 9.7 GB for hot experts (vs 37 GB uniform)
- Automatically adapts to your workload
- Full 1M context with MLA compressed KV cache (~7 GB)
Requirements
Stock llama.cpp cannot load this model. You need our custom build:
git clone https://github.com/setar/llama.cpp.git
cd llama.cpp
git checkout feat/moe-expert-persistence
cmake -S . -B build \
-DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF -DGGML_CCACHE=OFF
cmake --build build --config Release --clean-first -j"$(sysctl -n hw.ncpu)"
# For CUDA (NVIDIA):
# -DGGML_METAL=OFF -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native
# Replace -j"$(sysctl -n hw.ncpu)" with -j$(nproc)
How to run
./build/bin/llama-server \
--model DeepSeek-V4-Flash-Instruct-Q8xQ5.gguf-00001-of-00011.gguf \
--host 0.0.0.0 --port 8082 \
--ctx-size 1048576 \
--flash-attn on \
--batch-size 2048 --ubatch-size 1024 \
-t 20 --n-gpu-layers all \
--mlock \
--moe-hot-count auto \
--jinja \
--chat-template-file deepseek-ai-DeepSeek-V4.jinja
Model is split into 11 parts — llama.cpp auto-detects all, just point --model to part 1.
Per-layer MoE offload (--moe-hot-count auto)
Instead of keeping the same number of experts hot in every layer, auto mode tracks which experts are actually used and calculates per-layer optimal counts from accumulated statistics. The data persists across restarts in ~/.llama/expert_<model>.bin.
vs uniform hot-count
| Uniform hot=64 | Per-layer auto | Saving | |
|---|---|---|---|
| RAM (experts) | 37 GB | 9.7 GB | 27 GB |
| Decode speed | ~22 t/s | ~22 t/s | 0% |
| Layers with hot=4 | — | 24 (deterministic) | |
| Layers with hot=128 | — | 3 (H2, S25, S37) | |
| Average hot count | 64 | 17 |
Per-layer distribution (from ~466k expert activations)
H0: 4 H1: 4 H2: 128 (hash layers)
S25: 128 S29: 7 S32: 20 (mid scored)
S37: 128 S38: 83 S39: 82 (late scored)
Others: 4 (deterministic)
Performance (M3 Ultra)
| Metric | Value |
|---|---|
| Prefill | ~126 tok/s |
| Decode | ~20 t/s |
| Hot experts RAM | 9.7 GB |
| Context | 1M tokens |
Quantization details
tensor types:
f32: 535 tensors (norms, biases, embeddings)
q8_0: 661 tensors (attention, shared expert, router)
q5_K: 129 tensors (expert FFN: gate/up/down)
i32: 3 tensors (expert mapping)
BPW: ~5.2
Files
| File | Size | Description |
|---|---|---|
*.gguf-00001-of-00011 ~ *00011 |
184 GB total | Model weights (GGUF split) |
deepseek-ai-DeepSeek-V4.jinja |
2.3 KB | Required chat template |
Links
- Original model: deepseek-ai/DeepSeek-V4-Flash
- Custom llama.cpp build: setar/llama.cpp branch
feat/moe-expert-persistence - Per-layer MoE offload PR: #22694 — closed without review, AI co-author detected
- llama.cpp upstream: ggml-org/llama.cpp
Shout-out to the llama.cpp maintainers — your automated PR triage is impressively fast, even if it didn't catch that the code was reviewed, tested, and benchmarked before submission. 🤖 → ❤️
- Downloads last month
- 237
We're not able to determine the quantization variants.
Model tree for setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF
Base model
deepseek-ai/DeepSeek-V4-Flash