anemll's picture
Add files using upload-large-folder tool
3aa5c60 verified
metadata
license: mit
base_model:
  - deepseek-ai/DeepSeek-V4-Flash
library_name: llama.cpp
pipeline_tag: text-generation
tags:
  - gguf
  - deepseek-v4
  - deepseek-v4-flash
  - flash-moe
  - slot-bank
  - ssd
  - fp8
  - fp4
  - mxfp4
  - metal

DeepSeek V4 Flash FP4/FP8 SSD Flash-MoE Package

This repository contains an SSD Flash-MoE package for DeepSeek V4 Flash. It is intended for runtimes that can load a dense GGUF plus a routed expert sidecar.

Quantization

  • Dense/shared tensors: native DeepSeek FP8, represented as F8_E4M3_B128 in GGUF.
  • Routed MoE expert tensors: native DeepSeek FP4, represented as MXFP4 in the sidecar manifest.
  • Embeddings, output, norms, routing metadata, and IDs may remain BF16, F32, or I32 where appropriate.

The routed expert tensors are not stored in the dense GGUF. They are stored in the sidecar as layer-major binary banks.

Files

dense/
  model-dense.gguf
  flashmoe-package.json

sidecar/
  manifest.json
  layer_000.bin
  ...
  layer_042.bin

Model Details

  • Architecture: deepseek4
  • Blocks: 43
  • Experts: 256
  • Active experts per token: 6
  • Context length: 1048576
  • Dense GGUF tensors: 1199
  • Routed expert sidecar entries: 129

Example

./build/bin/llama-cli \
  -m dense/model-dense.gguf \
  --moe-mode slot-bank \
  --moe-sidecar sidecar \
  --moe-slot-bank 96 \
  --moe-topk 6 \
  -ngl 999 \
  --moe-cache-io-split 4 \
  -c 8192 \
  -b 128 \
  -ub 1 \
  --no-warmup \
  -p "What is Apple Neural Engine?" \
  -n 256

This package is not a standalone dense-only GGUF. Use a Flash-MoE aware llama.cpp build that supports DeepSeek V4 Flash, slot-bank mode, and the native FP8/FP4 tensor types.