How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:
# Run inference directly in the terminal:
llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:
# Run inference directly in the terminal:
llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:
# Run inference directly in the terminal:
./llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:
Use Docker
docker model run hf.co/teamblobfish/DeepSeek-V4-Pro-GGUF:
Quick Links

DeepSeek V4 Pro · GGUF

GGUF quantizations of deepseek-ai/DeepSeek-V4-Pro for use with the V4-aware llama.cpp fork at cchuter/llama.cpp @ feat/v4-port-cuda.

⚠️ These quants will not load on upstream ggml-org/llama.cpp. V4 architecture, V4-specific Metal and CUDA kernels, and the f16 KV pin are only present in the fork above. Upstreaming is gated on the V3.2/DSA PR (#21149) landing first.

🖥️ Supported backends: Apple Silicon (Metal), NVIDIA CUDA (Ada/Blackwell), and CPU. All 5 V4 custom ops (ggml_dsv4_rope_tail, ggml_dsv4_hc_split_sinkhorn, ggml_dsv4_hc_weighted_sum, ggml_dsv4_hc_expand, ggml_dsv4_fp8_kv_quantize) have Metal kernels AND CUDA kernels in this fork (validated 19/19 on RTX 5090, CUDA 12.8, SM_120 native). The CUDA FP8 path is gated behind __CUDA_ARCH__ >= 890; older NVIDIA hardware (Volta/Turing/Ampere) uses a software-emulated FP8 path that builds cleanly under -DCMAKE_CUDA_ARCHITECTURES=70 but hasn't been runtime-validated yet. CUDA testers wanted — file issues at the fork if you hit problems. V4 Pro's size also means most quants need multi-GPU or CPU+GPU partial offload; see size note below. ROCm / Vulkan / Metal-on-AMD have no V4 kernels and will fail at the first dsv4 op.

📐 V4 Pro is much larger than V4 Flash (61 layers × 384 routed experts; ~1.5 TiB BF16-experts-Q8 staging GGUF vs ~282 GiB Q8 for Flash). Even Q2_K-XL of Pro at 535 GiB exceeds 512 GiB unified RAM on a single Mac Studio — inference works but pages heavily. Practical fit on a single Studio is the smaller K-quants.

Available quants

Quant Size BPW Shards Decode (M3 Ultra) gate-tools Notes
Q8_0 ~1.46 TiB 8.50 30 build-validated only not run Reference. Exceeds 512 GiB unified RAM by ~3× — needs a host with 1.5 TiB RAM or heavy swap.
Q4_K_M-XL ~828 GiB 4.85 21 build-validated only not run K-quant body, V4-specific tensors pinned at Q8_0. Recommended if you have ~1 TiB RAM; otherwise pages from disk.
Q2_K-XL ~498 GiB 2.90 13 ~0.27 t/s prompt eval, ~0.18 t/s generation (CPU mmap, -ngl 0) ✓ pass XL-pinned K-quant. Tested: loads, runs, returns valid tool_calls for the V4 fork's tests/v4-port/tool-call-fixture.json ("What is the weather in Paris?" → get_weather({"city":"Paris"})). Fits CPU mmap path on 512 GiB Studio without OOM; recommended single-Studio variant.
imatrix/dsml.jinja ~5 KiB DSML chat template — pass via --chat-template-file for any quant whose shard 1 lacks the baked template. (All three quants here have it injected.)

-XL suffix means non-expert tensors (output_tensor, token_embd, attention projections, attention compressors, hyper-connection mixers, lightning indexer, NextN heads) are pinned at Q8_0; only the routed and shared experts use the named quant body. Same recipe as the V4 Flash fork's -XL variants.

Why no IQ-class quants in this release. V4 Pro's compressed-attention decode path generates a graph too large to fit Metal's recommendedMaxWorkingSetSize on M3 Ultra (487 GiB) when the model is also on Metal — both -ngl 999 and -ngl 25 partial-offload OOM during the first command buffer. CPU-only llama-imatrix runs at ~0.79 t/s prompt eval, and a single 4096-token chunk would take ~85 minutes; 1000 chunks is ~25 days. --cpu-moe (experts on CPU, rest on Metal) hangs at the load-tensors stage. Without a working imatrix, IQ1_*/IQ2_* quants cannot be built (the converter requires it for output_hc_fn.weight). On a host with ≥1.5 TiB unified RAM (or split-machine inference), the IQ-class ladder should be reachable; this release is the K-quant slice that builds end-to-end on a single 512 GiB Studio.

Loading

# Clone the V4-aware fork
git clone -b feat/v4-port-cuda https://github.com/cchuter/llama.cpp
cd llama.cpp

# Build for Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON && cmake --build build -j

# OR build for NVIDIA CUDA (Ada/Blackwell; CUDA toolkit >= 12.8 for native SM_120)
# cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="89;120" && cmake --build build -j
#
# V4 Pro almost always needs multi-GPU (sizes start at 498 GiB for Q2_K-XL).
# For multi-GPU CUDA, add `-DCMAKE_CXX_FLAGS=-DGGML_SCHED_MAX_SPLIT_INPUTS=128`
# to the cmake line — V4's dense per-layer inputs exceed the upstream
# scheduler default of 30 at multi-device split boundaries. Cost: ~200 MB
# extra scheduler memory.

# Download a quant that fits your RAM/disk budget
hf download teamblobfish/DeepSeek-V4-Pro-GGUF \
  --include "Q2_K-XL/*" \
  --local-dir ~/models/DeepSeek-V4-Pro-GGUF

# Run server (point at first shard; auto-loads the rest)
./build/bin/llama-server \
  --model ~/models/DeepSeek-V4-Pro-GGUF/Q2_K-XL/DeepSeek-V4-Pro-Q2_K-XL-00001-of-00013.gguf \
  --jinja \
  --reasoning off \
  --ctx-size 65536 \
  --n-gpu-layers 0 \
  --no-repack \
  --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0

⚙️ -ngl choice on M3 Ultra (512 GiB): -ngl 0 (CPU mmap) is the only configuration that loads V4 Pro Q2_K-XL/Q4_K_M-XL/Q8_0 cleanly without Metal OOM. Partial Metal offload (-ngl 1..N) hits kIOGPUCommandBufferCallbackErrorOutOfMemory during graph compute — V4's compressor decode path allocates intermediate buffers Metal can't satisfy when most weights are also Metal-resident. Full Metal (-ngl 999) only fits if the quant is below ~480 GiB total. Hosts with multiple GPUs / split-tensor offload across machines should work as expected.

⚙️ -cmoe (CPU MoE) on CUDA hosts — stick with -ub 128. -cmoe overrides MoE weights to CPU but doesn't directly control where the op runs. CUDA's op_offload defaults to true, and the CUDA backend offloads host-weight ops to GPU when batch_size ≥ 32 (see ggml/src/ggml-cuda/ggml-cuda.cu). Compute buffers are sized peak-liveness for n_ubatch-token graphs, so doubling -ub roughly doubles the GPU compute buffer. Reported by @fairydreaming running V4 Pro Q4_K_M on an RTX PRO 6000 Max-Q (96 GB, ~35 GB post-load headroom): -ub 128 fits; -ub 512 OOMs at load. Two options:

  • Recommended: keep -ub 128 for -cmoe runs on V4 Pro — best perf in this configuration.
  • Or pass --op-offload false to keep MoE compute truly on CPU regardless of ubatch — smaller GPU compute buffer, but slower if your CPU memory bandwidth is the bottleneck.

Sampling values match the model card recommendation (temperature=1.0, top_p=1.0); --reasoning off is the cleanest baseline for agent workloads.

Quirks worth knowing

  • --cache-type-k|v q8_0 is silently overridden to f16 on V4. Inherited V4 Flash quirk — V4's K is FP8-quantized at write time, breaking q8_0's per-block stationarity assumption.
  • --no-repack is required for V4 quants in CPU mode on hosts smaller than ~600 GiB RAM. Inherited V4 Flash quirk.
  • graph_max_nodes was bumped in this fork from 524288 → 2097152 to fit V4 Pro's wider compressor decode path. Older V4 builds will GGML_ASSERT on dsv4_build_compressor_decode_projected → ggml_set_rows when loading any Pro quant.
  • convert_hf_to_gguf.py --use-temp-file is required for V4 Pro. Without it, the in-memory tensor buffer exceeds 512 GiB RAM and the converter is killed by Jetsam on macOS.
  • Validation gates: tests/v4-port/run-all-gates.sh in the fork. Per-quant gate-tools runs were skipped on this release because every load is ~10 min on Pro at 512 GiB RAM; users with more RAM should re-run gates locally.

Provenance

  • Source: deepseek-ai/DeepSeek-V4-Pro HF safetensors (FP8 e4m3 weights, FP4 routed experts).
  • bf16-experts-Q8 staging GGUF (not published): built via convert_hf_to_gguf.py --outtype bf16 --deepseek4-expert-outtypes "w1=q8_0,w2=q8_0,w3=q8_0" --use-temp-file --deepseek4-expert-workers 16. Used as the source for Q2_K-XL and Q4_K_M-XL.
  • Q8_0: built via llama-quantize from the bf16-experts-Q8 staging GGUF (the runbook's safetensors → Q8_0 path was avoided for disk reasons; Q8 from BF16 has the same quant-hop count as Q8 from safetensors). No imatrix used (Q8 doesn't benefit).
  • Q4_K_M-XL / Q2_K-XL: produced via llama-quantize with the V4 fork's V4-tensor pin recipe (output_hc=q8_0, attn_compressor_*=q8_0, attn_q_a/b, attn_kv, attn_output_a/b, hc_attn=q8_0, hc_ffn=q8_0, indexer=q8_0, nextn=q8_0). No imatrix — all three K-quants here build cleanly without it (only IQ-class quants strictly require it).
  • Chat template: baked into shard 1 of every quant via gguf-py/gguf/scripts/gguf_new_metadata.py --chat-template "$(cat dsml.jinja)" after split.

License

MIT, matching the upstream DeepSeek V4 Pro license.

Downloads last month
36
GGUF
Model size
1.6T params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

2-bit

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for teamblobfish/DeepSeek-V4-Pro-GGUF

Quantized
(7)
this model