llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)DeepSeek V4 Pro · GGUF
GGUF quantizations of deepseek-ai/DeepSeek-V4-Pro for use with the V4-aware llama.cpp fork at cchuter/llama.cpp @ feat/v4-port-cuda.
⚠️ These quants will not load on upstream
ggml-org/llama.cpp. V4 architecture, V4-specific Metal and CUDA kernels, and the f16 KV pin are only present in the fork above. Upstreaming is gated on the V3.2/DSA PR (#21149) landing first.
🖥️ Supported backends: Apple Silicon (Metal), NVIDIA CUDA (Ada/Blackwell), and CPU. All 5 V4 custom ops (
ggml_dsv4_rope_tail,ggml_dsv4_hc_split_sinkhorn,ggml_dsv4_hc_weighted_sum,ggml_dsv4_hc_expand,ggml_dsv4_fp8_kv_quantize) have Metal kernels AND CUDA kernels in this fork (validated 19/19 on RTX 5090, CUDA 12.8, SM_120 native). The CUDA FP8 path is gated behind__CUDA_ARCH__ >= 890; older NVIDIA hardware (Volta/Turing/Ampere) uses a software-emulated FP8 path that builds cleanly under-DCMAKE_CUDA_ARCHITECTURES=70but hasn't been runtime-validated yet. CUDA testers wanted — file issues at the fork if you hit problems. V4 Pro's size also means most quants need multi-GPU or CPU+GPU partial offload; see size note below. ROCm / Vulkan / Metal-on-AMD have no V4 kernels and will fail at the first dsv4 op.
📐 V4 Pro is much larger than V4 Flash (61 layers × 384 routed experts; ~1.5 TiB BF16-experts-Q8 staging GGUF vs ~282 GiB Q8 for Flash). Even Q2_K-XL of Pro at 535 GiB exceeds 512 GiB unified RAM on a single Mac Studio — inference works but pages heavily. Practical fit on a single Studio is the smaller K-quants.
Available quants
| Quant | Size | BPW | Shards | Decode (M3 Ultra) | gate-tools | Notes |
|---|---|---|---|---|---|---|
| Q8_0 | ~1.46 TiB | 8.50 | 30 | build-validated only | not run | Reference. Exceeds 512 GiB unified RAM by ~3× — needs a host with 1.5 TiB RAM or heavy swap. |
| Q4_K_M-XL | ~828 GiB | 4.85 | 21 | build-validated only | not run | K-quant body, V4-specific tensors pinned at Q8_0. Recommended if you have ~1 TiB RAM; otherwise pages from disk. |
| Q2_K-XL | ~498 GiB | 2.90 | 13 | ~0.27 t/s prompt eval, ~0.18 t/s generation (CPU mmap, -ngl 0) | ✓ pass | XL-pinned K-quant. Tested: loads, runs, returns valid tool_calls for the V4 fork's tests/v4-port/tool-call-fixture.json ("What is the weather in Paris?" → get_weather({"city":"Paris"})). Fits CPU mmap path on 512 GiB Studio without OOM; recommended single-Studio variant. |
imatrix/dsml.jinja |
~5 KiB | — | — | — | — | DSML chat template — pass via --chat-template-file for any quant whose shard 1 lacks the baked template. (All three quants here have it injected.) |
-XL suffix means non-expert tensors (output_tensor, token_embd, attention projections, attention compressors, hyper-connection mixers, lightning indexer, NextN heads) are pinned at Q8_0; only the routed and shared experts use the named quant body. Same recipe as the V4 Flash fork's -XL variants.
Why no IQ-class quants in this release. V4 Pro's compressed-attention decode path generates a graph too large to fit Metal's
recommendedMaxWorkingSetSizeon M3 Ultra (487 GiB) when the model is also on Metal — both-ngl 999and-ngl 25partial-offload OOM during the first command buffer. CPU-onlyllama-imatrixruns at ~0.79 t/s prompt eval, and a single 4096-token chunk would take ~85 minutes; 1000 chunks is ~25 days.--cpu-moe(experts on CPU, rest on Metal) hangs at the load-tensors stage. Without a working imatrix,IQ1_*/IQ2_*quants cannot be built (the converter requires it foroutput_hc_fn.weight). On a host with ≥1.5 TiB unified RAM (or split-machine inference), the IQ-class ladder should be reachable; this release is the K-quant slice that builds end-to-end on a single 512 GiB Studio.
Loading
# Clone the V4-aware fork
git clone -b feat/v4-port-cuda https://github.com/cchuter/llama.cpp
cd llama.cpp
# Build for Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON && cmake --build build -j
# OR build for NVIDIA CUDA (Ada/Blackwell; CUDA toolkit >= 12.8 for native SM_120)
# cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="89;120" && cmake --build build -j
#
# V4 Pro almost always needs multi-GPU (sizes start at 498 GiB for Q2_K-XL).
# For multi-GPU CUDA, add `-DCMAKE_CXX_FLAGS=-DGGML_SCHED_MAX_SPLIT_INPUTS=128`
# to the cmake line — V4's dense per-layer inputs exceed the upstream
# scheduler default of 30 at multi-device split boundaries. Cost: ~200 MB
# extra scheduler memory.
# Download a quant that fits your RAM/disk budget
hf download teamblobfish/DeepSeek-V4-Pro-GGUF \
--include "Q2_K-XL/*" \
--local-dir ~/models/DeepSeek-V4-Pro-GGUF
# Run server (point at first shard; auto-loads the rest)
./build/bin/llama-server \
--model ~/models/DeepSeek-V4-Pro-GGUF/Q2_K-XL/DeepSeek-V4-Pro-Q2_K-XL-00001-of-00013.gguf \
--jinja \
--reasoning off \
--ctx-size 65536 \
--n-gpu-layers 0 \
--no-repack \
--temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0
⚙️
-nglchoice on M3 Ultra (512 GiB):-ngl 0(CPU mmap) is the only configuration that loads V4 Pro Q2_K-XL/Q4_K_M-XL/Q8_0 cleanly without Metal OOM. Partial Metal offload (-ngl 1..N) hitskIOGPUCommandBufferCallbackErrorOutOfMemoryduring graph compute — V4's compressor decode path allocates intermediate buffers Metal can't satisfy when most weights are also Metal-resident. Full Metal (-ngl 999) only fits if the quant is below ~480 GiB total. Hosts with multiple GPUs / split-tensor offload across machines should work as expected.
⚙️
-cmoe(CPU MoE) on CUDA hosts — stick with-ub 128.-cmoeoverrides MoE weights to CPU but doesn't directly control where the op runs. CUDA'sop_offloaddefaults totrue, and the CUDA backend offloads host-weight ops to GPU whenbatch_size ≥ 32(seeggml/src/ggml-cuda/ggml-cuda.cu). Compute buffers are sized peak-liveness forn_ubatch-token graphs, so doubling-ubroughly doubles the GPU compute buffer. Reported by @fairydreaming running V4 Pro Q4_K_M on an RTX PRO 6000 Max-Q (96 GB, ~35 GB post-load headroom):-ub 128fits;-ub 512OOMs at load. Two options:
- Recommended: keep
-ub 128for-cmoeruns on V4 Pro — best perf in this configuration.- Or pass
--op-offload falseto keep MoE compute truly on CPU regardless of ubatch — smaller GPU compute buffer, but slower if your CPU memory bandwidth is the bottleneck.
Sampling values match the model card recommendation (temperature=1.0, top_p=1.0); --reasoning off is the cleanest baseline for agent workloads.
Quirks worth knowing
--cache-type-k|v q8_0is silently overridden to f16 on V4. Inherited V4 Flash quirk — V4's K is FP8-quantized at write time, breaking q8_0's per-block stationarity assumption.--no-repackis required for V4 quants in CPU mode on hosts smaller than ~600 GiB RAM. Inherited V4 Flash quirk.graph_max_nodeswas bumped in this fork from 524288 → 2097152 to fit V4 Pro's wider compressor decode path. Older V4 builds will GGML_ASSERT ondsv4_build_compressor_decode_projected → ggml_set_rowswhen loading any Pro quant.convert_hf_to_gguf.py --use-temp-fileis required for V4 Pro. Without it, the in-memory tensor buffer exceeds 512 GiB RAM and the converter is killed by Jetsam on macOS.- Validation gates:
tests/v4-port/run-all-gates.shin the fork. Per-quant gate-tools runs were skipped on this release because every load is ~10 min on Pro at 512 GiB RAM; users with more RAM should re-run gates locally.
Provenance
- Source:
deepseek-ai/DeepSeek-V4-ProHF safetensors (FP8 e4m3 weights, FP4 routed experts). - bf16-experts-Q8 staging GGUF (not published): built via
convert_hf_to_gguf.py --outtype bf16 --deepseek4-expert-outtypes "w1=q8_0,w2=q8_0,w3=q8_0" --use-temp-file --deepseek4-expert-workers 16. Used as the source for Q2_K-XL and Q4_K_M-XL. - Q8_0: built via
llama-quantizefrom the bf16-experts-Q8 staging GGUF (the runbook's safetensors → Q8_0 path was avoided for disk reasons; Q8 from BF16 has the same quant-hop count as Q8 from safetensors). No imatrix used (Q8 doesn't benefit). - Q4_K_M-XL / Q2_K-XL: produced via
llama-quantizewith the V4 fork's V4-tensor pin recipe (output_hc=q8_0,attn_compressor_*=q8_0,attn_q_a/b,attn_kv,attn_output_a/b,hc_attn=q8_0,hc_ffn=q8_0,indexer=q8_0,nextn=q8_0). No imatrix — all three K-quants here build cleanly without it (only IQ-class quants strictly require it). - Chat template: baked into shard 1 of every quant via
gguf-py/gguf/scripts/gguf_new_metadata.py --chat-template "$(cat dsml.jinja)"after split.
License
MIT, matching the upstream DeepSeek V4 Pro license.
- Downloads last month
- 36
2-bit
4-bit
8-bit
Model tree for teamblobfish/DeepSeek-V4-Pro-GGUF
Base model
deepseek-ai/DeepSeek-V4-Pro
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="teamblobfish/DeepSeek-V4-Pro-GGUF", filename="", )