Instructions to use teamblobfish/DeepSeek-V4-Pro-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use teamblobfish/DeepSeek-V4-Pro-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="teamblobfish/DeepSeek-V4-Pro-GGUF",
	filename="Q2_K-XL/DeepSeek-V4-Pro-Q2_K-XL-00001-of-00013.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use teamblobfish/DeepSeek-V4-Pro-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Use Docker

docker model run hf.co/teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use teamblobfish/DeepSeek-V4-Pro-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "teamblobfish/DeepSeek-V4-Pro-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "teamblobfish/DeepSeek-V4-Pro-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Ollama
How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Ollama:
```
ollama run hf.co/teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
```

Unsloth Studio new

How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for teamblobfish/DeepSeek-V4-Pro-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for teamblobfish/DeepSeek-V4-Pro-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for teamblobfish/DeepSeek-V4-Pro-GGUF to start chatting

Pi new

How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Docker Model Runner:
```
docker model run hf.co/teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M
```

Lemonade

How to use teamblobfish/DeepSeek-V4-Pro-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull teamblobfish/DeepSeek-V4-Pro-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.DeepSeek-V4-Pro-GGUF-Q4_K_M

List all available models

lemonade list

How to use from the

Use from the

llama-cpp-python library

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="teamblobfish/DeepSeek-V4-Pro-GGUF",
	filename="",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

DeepSeek V4 Pro · GGUF

GGUF quantizations of deepseek-ai/DeepSeek-V4-Pro for use with the V4-aware llama.cpp fork at cchuter/llama.cpp @ feat/v4-port-cuda.

⚠️ These quants will not load on upstream ggml-org/llama.cpp. V4 architecture, V4-specific Metal and CUDA kernels, and the f16 KV pin are only present in the fork above. Upstreaming is gated on the V3.2/DSA PR (#21149) landing first.

🖥️ Supported backends: Apple Silicon (Metal), NVIDIA CUDA (Ada/Blackwell), and CPU. All 5 V4 custom ops (ggml_dsv4_rope_tail, ggml_dsv4_hc_split_sinkhorn, ggml_dsv4_hc_weighted_sum, ggml_dsv4_hc_expand, ggml_dsv4_fp8_kv_quantize) have Metal kernels AND CUDA kernels in this fork (validated 19/19 on RTX 5090, CUDA 12.8, SM_120 native). The CUDA FP8 path is gated behind __CUDA_ARCH__ >= 890; older NVIDIA hardware (Volta/Turing/Ampere) uses a software-emulated FP8 path that builds cleanly under -DCMAKE_CUDA_ARCHITECTURES=70 but hasn't been runtime-validated yet. CUDA testers wanted — file issues at the fork if you hit problems. V4 Pro's size also means most quants need multi-GPU or CPU+GPU partial offload; see size note below. ROCm / Vulkan / Metal-on-AMD have no V4 kernels and will fail at the first dsv4 op.

📐 V4 Pro is much larger than V4 Flash (61 layers × 384 routed experts; ~1.5 TiB BF16-experts-Q8 staging GGUF vs ~282 GiB Q8 for Flash). Even Q2_K-XL of Pro at 535 GiB exceeds 512 GiB unified RAM on a single Mac Studio — inference works but pages heavily. Practical fit on a single Studio is the smaller K-quants.

Available quants

Quant	Size	BPW	Shards	Decode (M3 Ultra)	gate-tools	Notes
Q8_0	~1.46 TiB	8.50	30	build-validated only	not run	Reference. Exceeds 512 GiB unified RAM by ~3× — needs a host with 1.5 TiB RAM or heavy swap.
Q4_K_M-XL	~828 GiB	4.85	21	build-validated only	not run	K-quant body, V4-specific tensors pinned at Q8_0. Recommended if you have ~1 TiB RAM; otherwise pages from disk.
Q2_K-XL	~498 GiB	2.90	13	~0.27 t/s prompt eval, ~0.18 t/s generation (CPU mmap, -ngl 0)	✓ pass	XL-pinned K-quant. Tested: loads, runs, returns valid `tool_calls` for the V4 fork's `tests/v4-port/tool-call-fixture.json` ("What is the weather in Paris?" → `get_weather({"city":"Paris"})`). Fits CPU mmap path on 512 GiB Studio without OOM; recommended single-Studio variant.
`imatrix/dsml.jinja`	~5 KiB	—	—	—	—	DSML chat template — pass via `--chat-template-file` for any quant whose shard 1 lacks the baked template. (All three quants here have it injected.)

-XL suffix means non-expert tensors (output_tensor, token_embd, attention projections, attention compressors, hyper-connection mixers, lightning indexer, NextN heads) are pinned at Q8_0; only the routed and shared experts use the named quant body. Same recipe as the V4 Flash fork's -XL variants.

Why no IQ-class quants in this release. V4 Pro's compressed-attention decode path generates a graph too large to fit Metal's recommendedMaxWorkingSetSize on M3 Ultra (487 GiB) when the model is also on Metal — both -ngl 999 and -ngl 25 partial-offload OOM during the first command buffer. CPU-only llama-imatrix runs at ~0.79 t/s prompt eval, and a single 4096-token chunk would take ~85 minutes; 1000 chunks is ~25 days. --cpu-moe (experts on CPU, rest on Metal) hangs at the load-tensors stage. Without a working imatrix, IQ1_*/IQ2_* quants cannot be built (the converter requires it for output_hc_fn.weight). On a host with ≥1.5 TiB unified RAM (or split-machine inference), the IQ-class ladder should be reachable; this release is the K-quant slice that builds end-to-end on a single 512 GiB Studio.

Loading

# Clone the V4-aware fork
git clone -b feat/v4-port-cuda https://github.com/cchuter/llama.cpp
cd llama.cpp

# Build for Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON && cmake --build build -j

# OR build for NVIDIA CUDA (Ada/Blackwell; CUDA toolkit >= 12.8 for native SM_120)
# cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="89;120" && cmake --build build -j
#
# V4 Pro almost always needs multi-GPU (sizes start at 498 GiB for Q2_K-XL).
# For multi-GPU CUDA, add `-DCMAKE_CXX_FLAGS=-DGGML_SCHED_MAX_SPLIT_INPUTS=128`
# to the cmake line — V4's dense per-layer inputs exceed the upstream
# scheduler default of 30 at multi-device split boundaries. Cost: ~200 MB
# extra scheduler memory.

# Download a quant that fits your RAM/disk budget
hf download teamblobfish/DeepSeek-V4-Pro-GGUF \
  --include "Q2_K-XL/*" \
  --local-dir ~/models/DeepSeek-V4-Pro-GGUF

# Run server (point at first shard; auto-loads the rest)
./build/bin/llama-server \
  --model ~/models/DeepSeek-V4-Pro-GGUF/Q2_K-XL/DeepSeek-V4-Pro-Q2_K-XL-00001-of-00013.gguf \
  --jinja \
  --reasoning off \
  --ctx-size 65536 \
  --n-gpu-layers 0 \
  --no-repack \
  --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0

⚙️ -ngl choice on M3 Ultra (512 GiB): -ngl 0 (CPU mmap) is the only configuration that loads V4 Pro Q2_K-XL/Q4_K_M-XL/Q8_0 cleanly without Metal OOM. Partial Metal offload (-ngl 1..N) hits kIOGPUCommandBufferCallbackErrorOutOfMemory during graph compute — V4's compressor decode path allocates intermediate buffers Metal can't satisfy when most weights are also Metal-resident. Full Metal (-ngl 999) only fits if the quant is below ~480 GiB total. Hosts with multiple GPUs / split-tensor offload across machines should work as expected.

⚙️ -cmoe (CPU MoE) on CUDA hosts — stick with -ub 128. -cmoe overrides MoE weights to CPU but doesn't directly control where the op runs. CUDA's op_offload defaults to true, and the CUDA backend offloads host-weight ops to GPU when batch_size ≥ 32 (see ggml/src/ggml-cuda/ggml-cuda.cu). Compute buffers are sized peak-liveness for n_ubatch-token graphs, so doubling -ub roughly doubles the GPU compute buffer. Reported by @fairydreaming running V4 Pro Q4_K_M on an RTX PRO 6000 Max-Q (96 GB, ~35 GB post-load headroom): -ub 128 fits; -ub 512 OOMs at load. Two options:

Recommended: keep -ub 128 for -cmoe runs on V4 Pro — best perf in this configuration.

Or pass --op-offload false to keep MoE compute truly on CPU regardless of ubatch — smaller GPU compute buffer, but slower if your CPU memory bandwidth is the bottleneck.

Sampling values match the model card recommendation (temperature=1.0, top_p=1.0); --reasoning off is the cleanest baseline for agent workloads.

Quirks worth knowing

--cache-type-k|v q8_0 is silently overridden to f16 on V4. Inherited V4 Flash quirk — V4's K is FP8-quantized at write time, breaking q8_0's per-block stationarity assumption.
--no-repack is required for V4 quants in CPU mode on hosts smaller than ~600 GiB RAM. Inherited V4 Flash quirk.
graph_max_nodes was bumped in this fork from 524288 → 2097152 to fit V4 Pro's wider compressor decode path. Older V4 builds will GGML_ASSERT on dsv4_build_compressor_decode_projected → ggml_set_rows when loading any Pro quant.
convert_hf_to_gguf.py --use-temp-file is required for V4 Pro. Without it, the in-memory tensor buffer exceeds 512 GiB RAM and the converter is killed by Jetsam on macOS.
Validation gates: tests/v4-port/run-all-gates.sh in the fork. Per-quant gate-tools runs were skipped on this release because every load is ~10 min on Pro at 512 GiB RAM; users with more RAM should re-run gates locally.

Provenance

Source: deepseek-ai/DeepSeek-V4-Pro HF safetensors (FP8 e4m3 weights, FP4 routed experts).
bf16-experts-Q8 staging GGUF (not published): built via convert_hf_to_gguf.py --outtype bf16 --deepseek4-expert-outtypes "w1=q8_0,w2=q8_0,w3=q8_0" --use-temp-file --deepseek4-expert-workers 16. Used as the source for Q2_K-XL and Q4_K_M-XL.
Q8_0: built via llama-quantize from the bf16-experts-Q8 staging GGUF (the runbook's safetensors → Q8_0 path was avoided for disk reasons; Q8 from BF16 has the same quant-hop count as Q8 from safetensors). No imatrix used (Q8 doesn't benefit).
Q4_K_M-XL / Q2_K-XL: produced via llama-quantize with the V4 fork's V4-tensor pin recipe (output_hc=q8_0, attn_compressor_*=q8_0, attn_q_a/b, attn_kv, attn_output_a/b, hc_attn=q8_0, hc_ffn=q8_0, indexer=q8_0, nextn=q8_0). No imatrix — all three K-quants here build cleanly without it (only IQ-class quants strictly require it).
Chat template: baked into shard 1 of every quant via gguf-py/gguf/scripts/gguf_new_metadata.py --chat-template "$(cat dsml.jinja)" after split.

License

MIT, matching the upstream DeepSeek V4 Pro license.

Downloads last month: 36

GGUF

Model size

1.6T params

Architecture

deepseek4

Hardware compatibility

2-bit

4-bit

8-bit

Model tree for teamblobfish/DeepSeek-V4-Pro-GGUF

Base model

deepseek-ai/DeepSeek-V4-Pro

Quantized

(7)

this model