anemll's picture
Add files using upload-large-folder tool
3aa5c60 verified
---
license: mit
base_model:
- deepseek-ai/DeepSeek-V4-Flash
library_name: llama.cpp
pipeline_tag: text-generation
tags:
- gguf
- deepseek-v4
- deepseek-v4-flash
- flash-moe
- slot-bank
- ssd
- fp8
- fp4
- mxfp4
- metal
---
# DeepSeek V4 Flash FP4/FP8 SSD Flash-MoE Package
This repository contains an SSD Flash-MoE package for DeepSeek V4 Flash.
It is intended for runtimes that can load a dense GGUF plus a routed expert
sidecar.
## Quantization
- Dense/shared tensors: native DeepSeek FP8, represented as `F8_E4M3_B128` in GGUF.
- Routed MoE expert tensors: native DeepSeek FP4, represented as `MXFP4` in the sidecar manifest.
- Embeddings, output, norms, routing metadata, and IDs may remain `BF16`, `F32`, or `I32` where appropriate.
The routed expert tensors are not stored in the dense GGUF. They are stored in
the sidecar as layer-major binary banks.
## Files
```text
dense/
model-dense.gguf
flashmoe-package.json
sidecar/
manifest.json
layer_000.bin
...
layer_042.bin
```
## Model Details
- Architecture: `deepseek4`
- Blocks: `43`
- Experts: `256`
- Active experts per token: `6`
- Context length: `1048576`
- Dense GGUF tensors: `1199`
- Routed expert sidecar entries: `129`
## Example
```bash
./build/bin/llama-cli \
-m dense/model-dense.gguf \
--moe-mode slot-bank \
--moe-sidecar sidecar \
--moe-slot-bank 96 \
--moe-topk 6 \
-ngl 999 \
--moe-cache-io-split 4 \
-c 8192 \
-b 128 \
-ub 1 \
--no-warmup \
-p "What is Apple Neural Engine?" \
-n 256
```
This package is not a standalone dense-only GGUF. Use a Flash-MoE aware
llama.cpp build that supports DeepSeek V4 Flash, slot-bank mode, and the
native FP8/FP4 tensor types.