license: mit
library_name: gguf
pipeline_tag: text-generation
base_model: deepseek-ai/DeepSeek-V4-Flash
base_model_relation: quantized
quantized_by: antirez
language:
- en
tags:
- gguf
- quantized
- deepseek
- deepseek-v4
- deepseek-v4-flash
- moe
- mixture-of-experts
- 2-bit
- 4-bit
- iq2_xxs
- q2_k
- q4_k
- ds4
- apple-silicon
- metal
DeepSeek V4 Flash — GGUF for ds4
This quants are specific for the DS4 inference engine. They may work with other inference engines or not (they should, but not the MTP model which requires a specific loader).
https://github.com/antirez/ds4
Files
| File | Size | Routed experts (ffn_{gate,up,down}_exps) |
Everything else |
|---|---|---|---|
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf |
80.8 GiB | IQ2_XXS (gate, up) + Q2_K (down) |
Q8_0 attn proj / shared experts / output, F16 router + embed + indexer + compressor + HC, F32 norms / sinks / bias |
DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf |
153.3 GiB | Q4_K (all three) |
same as above |
DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf |
3.6 GiB | MTP / speculative-decoding support (optional, not standalone). |
Use q2 on 128 GB Mac machines, q4 on machines with ≥ 256 GB RAM, pair either with MTP for optional speculative decoding.
Quantization recipe
The filename is the spec. In detail, for the q2 file:
| Tensor class | Quant | Notes |
|---|---|---|
blk.*.ffn_gate_exps, blk.*.ffn_up_exps |
IQ2_XXS |
routed-expert up/gate |
blk.*.ffn_down_exps |
Q2_K |
routed-expert down (K-quant for quality) |
blk.*.ffn_{gate,up,down}_shexp |
Q8_0 |
shared experts |
blk.*.attn_q_a, attn_q_b, attn_kv, attn_output_a, attn_output_b |
Q8_0 |
all attention projections (MLA + low-rank output) |
output.weight |
Q8_0 |
output head |
token_embd.weight |
F16 |
input embedding |
blk.*.ffn_gate_inp (router) |
F16 |
learned router |
blk.*.exp_probs_b (router bias), blk.*.attn_sinks, all *_norm.weight |
F32 |
|
blk.*.ffn_gate_tid2eid |
I32 |
hash-routing tables (first 3 layers only) |
blk.*.attn_compressor_*, blk.*.indexer_*, blk.*.hc_*, blk.*.output_hc_* |
F16 / F32 |
DSv4-specific auxiliary blocks |
For the q4 file, only the three routed-expert classes change to Q4_K. Everything else is byte-for-byte identical to the q2 recipe.
The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts. Keeping the decision-making components at Q8_0 preserves model behavior; crushing the experts buys the size.
Usage
git clone https://github.com/antirez/ds4
cd ds4
./download_model.sh q2 # 128 GB RAM machines
./download_model.sh q4 # >= 256 GB RAM machines
./download_model.sh mtp # optional MTP / speculative decoding
make
./ds4 -p "Explain Redis streams in one paragraph."
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
The download_model.sh script fetches from this repository, resumes partial downloads, and points ./ds4flash.gguf at the selected variant.
License
MIT. The base model copyright is held by DeepSeek; the GGUFs are redistributed under the base model's release terms.