Instructions to use deepseek-ai/DeepSeek-V4-Pro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepseek-ai/DeepSeek-V4-Pro with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Pro")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Pro")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V4-Pro")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use deepseek-ai/DeepSeek-V4-Pro with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepseek-ai/DeepSeek-V4-Pro"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Pro",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepseek-ai/DeepSeek-V4-Pro

SGLang

How to use deepseek-ai/DeepSeek-V4-Pro with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepseek-ai/DeepSeek-V4-Pro" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Pro",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepseek-ai/DeepSeek-V4-Pro" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Pro",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Pro with Docker Model Runner:
```
docker model run hf.co/deepseek-ai/DeepSeek-V4-Pro
```

关于 "Observations and Proposals" 中激活函数建议的疑问：去掉 gate projection 为何能放宽 EP 带宽要求？

#126

by MarjorTom - opened 12 days ago

Discussion

MarjorTom

12 days ago

感谢团队分享如此详尽的技术报告，受益匪浅。在反复阅读 Section 3.1 "Observations and Proposals" 中关于激活函数的建议时，我对其中的推导逻辑有一些困惑，希望能得到作者的进一步解释。

原文

Activation Function. We propose replacing SwiGLU with a low-cost element-wise activation that involves no exponential or division operations. This lightens the post-GEMM processing directly, and under the same parameter budget, removing the gate projection enlarges the intermediate dimension d, further relaxing the bandwidth requirement.

我的推导

根据论文前面给出的临界比公式：

$\frac{C}{B} \leq \frac{V_{\text{comp}}}{V_{\text{comm}}} = \frac{6hd}{3h} = 2d$

场景 A（SwiGLU）：

参数量：3hd（gate + up + down）
每 token-expert 算力：6hd FLOPs
通信量：3h bytes
临界比：2d

场景 B（普通 FFN，保持参数预算不变）：

参数预算约束：2h·d' = 3hd，得 d' = 1.5d
每 token-expert 算力：2 × 2h·d' = 4h·d' = 6hd FLOPs（与场景 A 相同）
通信量：3h bytes（与场景 A 相同）
临界比：6hd / 3h = 2d（与场景 A 相同）

我的困惑

在"同参数预算"这一约束下，我推导出算力、通信量、临界比三者都没有变化。也就是说，即使去掉 gate projection 并把节省的参数分配到更大的 d' 上，EP 通信与计算的平衡关系似乎没有改变。

我的推导是否哪里有误？

再次感谢团队的开源与分享！

LyricZ

DeepSeek org 12 days ago

对于一个 linear 2 block 来说，epilogue push 出去的通信量（虽然 epilogue 的时间不等于通信到对面的时间，但是其他 SM 和 dispatch 把 NVLink queue 打满的时候，flush 出 SM 也是有代价的）是恒定的，但是如果矩阵乘的 K 变大，MMA 的时间会变长，这时候 MMA 掩盖 epilogue 的机会会更大，说的是对于一个 block。这里的表述不太准确，我们改一下。谢谢指出！

MarjorTom

12 days ago

对于一个 linear 2 block 来说，epilogue push 出去的通信量（虽然 epilogue 的时间不等于通信到对面的时间，但是其他 SM 和 dispatch 把 NVLink queue 打满的时候，flush 出 SM 也是有代价的）是恒定的，但是如果矩阵乘的 K 变大，MMA 的时间会变长，这时候 MMA 掩盖 epilogue 的机会会更大，说的是对于一个 block。这里的表述不太准确，我们改一下。谢谢指出！

非常感谢解答！这下明白了，原来是这么细致的考虑

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment