--- license: mit base_model: - deepseek-ai/DeepSeek-V4-Flash library_name: llama.cpp pipeline_tag: text-generation tags: - gguf - deepseek-v4 - deepseek-v4-flash - flash-moe - slot-bank - ssd - fp8 - fp4 - mxfp4 - metal --- # DeepSeek V4 Flash FP4/FP8 SSD Flash-MoE Package This repository contains an SSD Flash-MoE package for DeepSeek V4 Flash. It is intended for runtimes that can load a dense GGUF plus a routed expert sidecar. ## Quantization - Dense/shared tensors: native DeepSeek FP8, represented as `F8_E4M3_B128` in GGUF. - Routed MoE expert tensors: native DeepSeek FP4, represented as `MXFP4` in the sidecar manifest. - Embeddings, output, norms, routing metadata, and IDs may remain `BF16`, `F32`, or `I32` where appropriate. The routed expert tensors are not stored in the dense GGUF. They are stored in the sidecar as layer-major binary banks. ## Files ```text dense/ model-dense.gguf flashmoe-package.json sidecar/ manifest.json layer_000.bin ... layer_042.bin ``` ## Model Details - Architecture: `deepseek4` - Blocks: `43` - Experts: `256` - Active experts per token: `6` - Context length: `1048576` - Dense GGUF tensors: `1199` - Routed expert sidecar entries: `129` ## Example ```bash ./build/bin/llama-cli \ -m dense/model-dense.gguf \ --moe-mode slot-bank \ --moe-sidecar sidecar \ --moe-slot-bank 96 \ --moe-topk 6 \ -ngl 999 \ --moe-cache-io-split 4 \ -c 8192 \ -b 128 \ -ub 1 \ --no-warmup \ -p "What is Apple Neural Engine?" \ -n 256 ``` This package is not a standalone dense-only GGUF. Use a Flash-MoE aware llama.cpp build that supports DeepSeek V4 Flash, slot-bank mode, and the native FP8/FP4 tensor types.