Text Generation
GGUF
llama.cpp
deepseek-v4
deepseek-v4-flash
flash-moe
slot-bank
ssd
fp8
fp4
mxfp4
metal
conversational
How to use from
llama.cppInstall from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSD# Run inference directly in the terminal:
llama-cli -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSDUse pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSD# Run inference directly in the terminal:
./llama-cli -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSDBuild from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSD# Run inference directly in the terminal:
./build/bin/llama-cli -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSDUse Docker
docker model run hf.co/anemll/DeepSeek-V4-Flash-FP4-FP8-SSDQuick Links
DeepSeek V4 Flash FP4/FP8 SSD Flash-MoE Package
This repository contains an SSD Flash-MoE package for DeepSeek V4 Flash. It is intended for runtimes that can load a dense GGUF plus a routed expert sidecar.
Quantization
- Dense/shared tensors: native DeepSeek FP8, represented as
F8_E4M3_B128in GGUF. - Routed MoE expert tensors: native DeepSeek FP4, represented as
MXFP4in the sidecar manifest. - Embeddings, output, norms, routing metadata, and IDs may remain
BF16,F32, orI32where appropriate.
The routed expert tensors are not stored in the dense GGUF. They are stored in the sidecar as layer-major binary banks.
Files
dense/
model-dense.gguf
flashmoe-package.json
sidecar/
manifest.json
layer_000.bin
...
layer_042.bin
Model Details
- Architecture:
deepseek4 - Blocks:
43 - Experts:
256 - Active experts per token:
6 - Context length:
1048576 - Dense GGUF tensors:
1199 - Routed expert sidecar entries:
129
Example
./build/bin/llama-cli \
-m dense/model-dense.gguf \
--moe-mode slot-bank \
--moe-sidecar sidecar \
--moe-slot-bank 96 \
--moe-topk 6 \
-ngl 999 \
--moe-cache-io-split 4 \
-c 8192 \
-b 128 \
-ub 1 \
--no-warmup \
-p "What is Apple Neural Engine?" \
-n 256
This package is not a standalone dense-only GGUF. Use a Flash-MoE aware llama.cpp build that supports DeepSeek V4 Flash, slot-bank mode, and the native FP8/FP4 tensor types.
- Downloads last month
- 1,377
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.
Model tree for anemll/DeepSeek-V4-Flash-FP4-FP8-SSD
Base model
deepseek-ai/DeepSeek-V4-Flash
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSD# Run inference directly in the terminal: llama-cli -hf anemll/DeepSeek-V4-Flash-FP4-FP8-SSD