---
license: mit
base_model:
  - deepseek-ai/DeepSeek-V4-Flash
library_name: llama.cpp
pipeline_tag: text-generation
tags:
  - gguf
  - deepseek-v4
  - deepseek-v4-flash
  - flash-moe
  - slot-bank
  - ssd
  - fp8
  - fp4
  - mxfp4
  - metal
---

# DeepSeek V4 Flash FP4/FP8 SSD Flash-MoE Package

This repository contains an SSD Flash-MoE package for DeepSeek V4 Flash.
It is intended for runtimes that can load a dense GGUF plus a routed expert
sidecar.

## Quantization

- Dense/shared tensors: native DeepSeek FP8, represented as `F8_E4M3_B128` in GGUF.
- Routed MoE expert tensors: native DeepSeek FP4, represented as `MXFP4` in the sidecar manifest.
- Embeddings, output, norms, routing metadata, and IDs may remain `BF16`, `F32`, or `I32` where appropriate.

The routed expert tensors are not stored in the dense GGUF. They are stored in
the sidecar as layer-major binary banks.

## Files

```text
dense/
  model-dense.gguf
  flashmoe-package.json

sidecar/
  manifest.json
  layer_000.bin
  ...
  layer_042.bin
```

## Model Details

- Architecture: `deepseek4`
- Blocks: `43`
- Experts: `256`
- Active experts per token: `6`
- Context length: `1048576`
- Dense GGUF tensors: `1199`
- Routed expert sidecar entries: `129`

## Example

```bash
./build/bin/llama-cli \
  -m dense/model-dense.gguf \
  --moe-mode slot-bank \
  --moe-sidecar sidecar \
  --moe-slot-bank 96 \
  --moe-topk 6 \
  -ngl 999 \
  --moe-cache-io-split 4 \
  -c 8192 \
  -b 128 \
  -ub 1 \
  --no-warmup \
  -p "What is Apple Neural Engine?" \
  -n 256
```

This package is not a standalone dense-only GGUF. Use a Flash-MoE aware
llama.cpp build that supports DeepSeek V4 Flash, slot-bank mode, and the
native FP8/FP4 tensor types.