How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF",
	filename="",
)
output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

DeepSeek V4 Flash — Q8xQ5 GGUF

Mixed-precision quantization: Q8_0 (attention, shared expert) + Q5_K (routed experts). Quality equivalent to standard Q5_K_M.

Parameter Value
Model DeepSeek V4 Flash
Architecture 284B total, 13B active (MoE 256 experts, top-6)
Format GGUF Q8xQ5 (11 parts)
Size 184 GB
Context up to 1M tokens
Hardware target Apple M3 Ultra 256 GB

Features

  • Mixed-precision: attention/shared expert/router in Q8_0, experts in Q5_K
  • Per-layer MoE offload: --moe-hot-count auto keeps only needed experts in RAM
    • Typical: 9.7 GB for hot experts (vs 37 GB uniform)
    • Automatically adapts to your workload
  • Full 1M context with MLA compressed KV cache (~7 GB)

Requirements

Stock llama.cpp cannot load this model. You need our custom build:

git clone https://github.com/setar/llama.cpp.git
cd llama.cpp
git checkout feat/moe-expert-persistence
cmake -S . -B build \
  -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON \
  -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF -DGGML_CCACHE=OFF
cmake --build build --config Release --clean-first -j"$(sysctl -n hw.ncpu)"

# For CUDA (NVIDIA):
#   -DGGML_METAL=OFF -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native
#   Replace -j"$(sysctl -n hw.ncpu)" with -j$(nproc)

How to run

./build/bin/llama-server \
  --model DeepSeek-V4-Flash-Instruct-Q8xQ5.gguf-00001-of-00011.gguf \
  --host 0.0.0.0 --port 8082 \
  --ctx-size 1048576 \
  --flash-attn on \
  --batch-size 2048 --ubatch-size 1024 \
  -t 20 --n-gpu-layers all \
  --mlock \
  --moe-hot-count auto \
  --jinja \
  --chat-template-file deepseek-ai-DeepSeek-V4.jinja

Model is split into 11 parts — llama.cpp auto-detects all, just point --model to part 1.

Per-layer MoE offload (--moe-hot-count auto)

Instead of keeping the same number of experts hot in every layer, auto mode tracks which experts are actually used and calculates per-layer optimal counts from accumulated statistics. The data persists across restarts in ~/.llama/expert_<model>.bin.

vs uniform hot-count

Uniform hot=64 Per-layer auto Saving
RAM (experts) 37 GB 9.7 GB 27 GB
Decode speed ~22 t/s ~22 t/s 0%
Layers with hot=4 24 (deterministic)
Layers with hot=128 3 (H2, S25, S37)
Average hot count 64 17

Per-layer distribution (from ~466k expert activations)

H0:     4   H1:     4   H2:   128     (hash layers)
S25:  128   S29:     7   S32:    20   (mid scored)
S37:  128   S38:    83   S39:    82   (late scored)
Others:    4                           (deterministic)

Performance (M3 Ultra)

Metric Value
Prefill ~126 tok/s
Decode ~20 t/s
Hot experts RAM 9.7 GB
Context 1M tokens

Quantization details

tensor types:
  f32:   535 tensors  (norms, biases, embeddings)
  q8_0:  661 tensors  (attention, shared expert, router)
  q5_K:  129 tensors  (expert FFN: gate/up/down)
  i32:     3 tensors  (expert mapping)
BPW: ~5.2

Files

File Size Description
*.gguf-00001-of-00011 ~ *00011 184 GB total Model weights (GGUF split)
deepseek-ai-DeepSeek-V4.jinja 2.3 KB Required chat template

Links

Shout-out to the llama.cpp maintainers — your automated PR triage is impressively fast, even if it didn't catch that the code was reviewed, tested, and benchmarked before submission. 🤖 → ❤️

Downloads last month
237
GGUF
Model size
284B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for setar007/DeepSeek-V4-Flash-Q8xQ5-GGUF

Quantized
(30)
this model