Instructions to use deepseek-ai/DeepSeek-V4-Pro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepseek-ai/DeepSeek-V4-Pro with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Pro")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Pro")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V4-Pro")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use deepseek-ai/DeepSeek-V4-Pro with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepseek-ai/DeepSeek-V4-Pro"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Pro",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepseek-ai/DeepSeek-V4-Pro

SGLang

How to use deepseek-ai/DeepSeek-V4-Pro with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepseek-ai/DeepSeek-V4-Pro" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Pro",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepseek-ai/DeepSeek-V4-Pro" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Pro",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Pro with Docker Model Runner:
```
docker model run hf.co/deepseek-ai/DeepSeek-V4-Pro
```

Technical Report Summary

#129

by mishig - opened 12 days ago

Discussion

mishig

12 days ago

DeepSeek-V4

Towards Highly Efficient Million-Token Context Intelligence

Preview Release · DeepSeek-AI · Model checkpoints

You can find the full technical report here.

TL;DR — Two new open MoE models (V4-Pro 1.6T/49B-active, V4-Flash 284B/13B-active) that run native 1M-token context at a fraction of the compute and memory of DeepSeek-V3.2. The big contributions are architectural: a hybrid CSA + HCA attention stack, Manifold-Constrained Hyper-Connections (mHC), and the Muon optimizer at 1.6T scale.

Metric	Value
V4-Pro FLOPs vs V3.2 @ 1M ctx	27%
V4-Pro KV cache vs V3.2	10%
V4-Flash FLOPs / KV cache vs V3.2	10% / 7%
Pre-training tokens	32–33T

Why it matters

Million-token context at practical cost. The quadratic-attention wall is the real bottleneck on test-time scaling — dropping 1M-context inference to ~10% of prior KV-cache footprint makes long-horizon agentic and multi-document workloads economically routine rather than prohibitive.
Frontier-class open weights. V4-Pro-Max beats GPT-5.2 and Gemini-3.0-Pro on reasoning, trails GPT-5.4 / Gemini-3.1-Pro by ~3–6 months, and on internal agent evals surpasses Claude Sonnet 4.5 while approaching Opus 4.5.
Unlocks the next scaling regime. The authors frame this explicitly: efficient ultra-long sequences are the foundation for further test-time scaling and for future paradigms like online learning.

Figure 1 — Left: V4-Pro-Max vs Claude-Opus-4.6, GPT-5.4, Gemini-3.1 on knowledge, reasoning, and agentic benchmarks. Right: single-token inference FLOPs and accumulated KV-cache size vs DeepSeek-V3.2 out to 1M tokens.

What is actually new

1. Hybrid attention: CSA + HCA

The headline architectural change. Two complementary attention variants are interleaved:

Compressed Sparse Attention (CSA) — compresses every m KV tokens into one entry via learned compression weights, then applies DeepSeek Sparse Attention (top-k selection via a Lightning Indexer) plus a sliding window for local detail.
Heavily Compressed Attention (HCA) — same compression idea but far more aggressive (m′ ≫ m) with dense attention over the compressed stream. Interleaving CSA/HCA layers is what makes 1M context tractable.

Figure 3 — CSA compresses KV entries m-to-1, then uses a Lightning Indexer to top-k select compressed blocks; a sliding window preserves local fine-grained dependencies.

2. Manifold-Constrained Hyper-Connections (mHC)

Upgrades residual connections by projecting the residual mapping matrix B_l onto the Birkhoff polytope (doubly stochastic matrices) via Sinkhorn-Knopp iteration. This bounds the spectral norm ≤ 1, keeping the transformation non-expansive. Fixes the numerical instability that prevented stacking plain Hyper-Connections deeply — a real contribution to residual-stream design.

3. Muon optimizer at 1.6T scale

First deployment of Muon on a trillion-plus MoE, paired with a custom hybrid ZeRO strategy. Reported faster convergence and better stability than AdamW-class baselines at this size.

4. FP4 quantization-aware training

Not just inference quantization — FP4 QAT is applied to MoE expert weights and the indexer QK path during training itself. FP4×FP8 GEMMs can be up to ⅓ faster on future hardware.

Figure 2 — Overall V4 block. Hybrid CSA/HCA attention, DeepSeekMoE feed-forward, mHC-strengthened residual connections, and MTP heads at the output.

5. Infrastructure firsts

Single fused kernel for MoE that overlaps compute, communication, and memory access simultaneously.
TileLang — DSL balancing kernel productivity vs efficiency.
Batch-invariant deterministic kernel library — bitwise reproducibility across training and inference.
Two-stage contextual parallelism for compressed attention.
Heterogeneous KV-cache structure with on-disk storage for shared-prefix reuse.

6. Post-training: specialist-then-distill

Train independent SFT + GRPO experts per domain (math, code, agent, instruction-following), then consolidate into one model via on-policy distillation with reverse-KL loss. A cleaner formulation of the "many specialists → one generalist" recipe.

7. Architectural housekeeping worth noting

MoE affinity scoring: Sigmoid → Sqrt(Softplus).
Removed cap on routing target nodes for MoE.
Hash-routed MoE replaces dense FFNs in the earliest transformer blocks.

Bottom line

The importance of the paper is the efficiency-per-context-token curve, not raw capability — V4 makes 1M-token reasoning economically viable on open weights. The novelty concentrates in two places:

The CSA/HCA hybrid attention scheme for ultra-long contexts.
mHC's manifold constraint stabilizing deep residual architectures.

The remainder is strong engineering consolidation: Muon at scale, FP4-QAT, deterministic kernels, on-policy distillation. Capability-wise V4-Pro-Max appears to land ~3–6 months behind the leading closed frontier models, while leading the open-weights pack on agentic and long-context tasks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment