How to use from
Unsloth Studio
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for antirez/deepseek-v4-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for antirez/deepseek-v4-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for antirez/deepseek-v4-gguf to start chatting
Quick Links

DeepSeek V4 Flash — GGUF for ds4

This quants are specific for the DS4 inference engine. They may work with other inference engines or not (they should, but not the MTP model which requires a specific loader).

https://github.com/antirez/ds4

Files

File Size Routed experts (ffn_{gate,up,down}_exps) Everything else
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf 80.8 GiB IQ2_XXS (gate, up) + Q2_K (down) Q8_0 attn proj / shared experts / output, F16 router + embed + indexer + compressor + HC, F32 norms / sinks / bias
DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf 153.3 GiB Q4_K (all three) same as above
DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf 3.6 GiB MTP / speculative-decoding support (optional, not standalone).

Use q2 on 128 GB Mac machines, q4 on machines with ≥ 256 GB RAM, pair either with MTP for optional speculative decoding.

Quantization recipe

The filename is the spec. In detail, for the q2 file:

Tensor class Quant Notes
blk.*.ffn_gate_exps, blk.*.ffn_up_exps IQ2_XXS routed-expert up/gate
blk.*.ffn_down_exps Q2_K routed-expert down (K-quant for quality)
blk.*.ffn_{gate,up,down}_shexp Q8_0 shared experts
blk.*.attn_q_a, attn_q_b, attn_kv, attn_output_a, attn_output_b Q8_0 all attention projections (MLA + low-rank output)
output.weight Q8_0 output head
token_embd.weight F16 input embedding
blk.*.ffn_gate_inp (router) F16 learned router
blk.*.exp_probs_b (router bias), blk.*.attn_sinks, all *_norm.weight F32
blk.*.ffn_gate_tid2eid I32 hash-routing tables (first 3 layers only)
blk.*.attn_compressor_*, blk.*.indexer_*, blk.*.hc_*, blk.*.output_hc_* F16 / F32 DSv4-specific auxiliary blocks

For the q4 file, only the three routed-expert classes change to Q4_K. Everything else is byte-for-byte identical to the q2 recipe.

The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts. Keeping the decision-making components at Q8_0 preserves model behavior; crushing the experts buys the size.

Usage

git clone https://github.com/antirez/ds4
cd ds4
./download_model.sh q2     # 128 GB RAM machines
./download_model.sh q4     # >= 256 GB RAM machines
./download_model.sh mtp    # optional MTP / speculative decoding
make

./ds4 -p "Explain Redis streams in one paragraph."
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

The download_model.sh script fetches from this repository, resumes partial downloads, and points ./ds4flash.gguf at the selected variant.

License

MIT. The base model copyright is held by DeepSeek; the GGUFs are redistributed under the base model's release terms.

Downloads last month
46,417
GGUF
Model size
284B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for antirez/deepseek-v4-gguf

Quantized
(39)
this model