metadata
license: mit
base_model:
- deepseek-ai/DeepSeek-V4-Flash
library_name: llama.cpp
pipeline_tag: text-generation
tags:
- gguf
- deepseek-v4
- deepseek-v4-flash
- flash-moe
- slot-bank
- ssd
- fp8
- fp4
- mxfp4
- metal
DeepSeek V4 Flash FP4/FP8 SSD Flash-MoE Package
This repository contains an SSD Flash-MoE package for DeepSeek V4 Flash. It is intended for runtimes that can load a dense GGUF plus a routed expert sidecar.
Quantization
- Dense/shared tensors: native DeepSeek FP8, represented as
F8_E4M3_B128in GGUF. - Routed MoE expert tensors: native DeepSeek FP4, represented as
MXFP4in the sidecar manifest. - Embeddings, output, norms, routing metadata, and IDs may remain
BF16,F32, orI32where appropriate.
The routed expert tensors are not stored in the dense GGUF. They are stored in the sidecar as layer-major binary banks.
Files
dense/
model-dense.gguf
flashmoe-package.json
sidecar/
manifest.json
layer_000.bin
...
layer_042.bin
Model Details
- Architecture:
deepseek4 - Blocks:
43 - Experts:
256 - Active experts per token:
6 - Context length:
1048576 - Dense GGUF tensors:
1199 - Routed expert sidecar entries:
129
Example
./build/bin/llama-cli \
-m dense/model-dense.gguf \
--moe-mode slot-bank \
--moe-sidecar sidecar \
--moe-slot-bank 96 \
--moe-topk 6 \
-ngl 999 \
--moe-cache-io-split 4 \
-c 8192 \
-b 128 \
-ub 1 \
--no-warmup \
-p "What is Apple Neural Engine?" \
-n 256
This package is not a standalone dense-only GGUF. Use a Flash-MoE aware llama.cpp build that supports DeepSeek V4 Flash, slot-bank mode, and the native FP8/FP4 tensor types.