Instructions to use deepseek-ai/DeepSeek-V4-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepseek-ai/DeepSeek-V4-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use deepseek-ai/DeepSeek-V4-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepseek-ai/DeepSeek-V4-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash

SGLang

How to use deepseek-ai/DeepSeek-V4-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepseek-ai/DeepSeek-V4-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepseek-ai/DeepSeek-V4-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Flash with Docker Model Runner:
```
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
```

Unable to run on 2x RTX Pro 6000 (DEEP_GEMM problem)

#15

by stev236 - opened 12 days ago

Discussion

stev236

12 days ago

•

edited 12 days ago

This model at first glance seems like a perfect fit for a 192GB of VRAM setup thanks to the FP4 MoE.
I followed the vLLM recipe (https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash), but unfortunately it doesn't seem to support SM120 GPUs. The package vllm/utils/deep_gemm.py throws the assertion error: Unsupported architecture.
I tried turning off deep gemm using: VLLM_USE_DEEP_GEMM=0, but it doesn't help.

Does anybody know of any way to get it working on SM120?
Cheers!

CryptoAIM

12 days ago

it was teased a lot that huawei support would be first, so probs just wait?

retowyss

12 days ago

Same here, doesn't work with the recipe.

I wouldn't mind those Huawei 300i Duo - I've seriously considered buying a few more than once, but every time I can't find enough information that assures me I won't just have 10k in paper weights.

vody-am

12 days ago

Same error here: (EngineCore_DP0 pid=256) ERROR 04-24 15:20:17 [core.py:1110] RuntimeError: Worker failed with error 'Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/hyperconnection.hpp:56): Unsupported architecture', please check the stack trace above for the root cause

CryptoAIM

12 days ago

seems like a problem with the mhc arch. I mean it handles attention signals fundamently differeng, so no wonder theres no support yet. Too bad, because its always deepseek, whos hinderd by that. (MoE, MTP etc.) Though I heard they wrote custom kernels for mhc or N gram, so maybe they habe a fix. idk though. you may just have to wait

billob01

12 days ago

This model at first glance seems like a perfect fit for a 192GB of VRAM setup thanks to the FP4 MoE.
I followed the vLLM recipe (https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash), but unfortunately it doesn't seem to support SM120 GPUs. The package vllm/utils/deep_gemm.py throws the assertion error: Unsupported architecture.
I tried turning off deep gemm using: VLLM_USE_DEEP_GEMM=0, but it doesn't help.

Does anybody know of any way to get it working on SM120?
Cheers!

It doesnt support sm120. Only supports are b300 from blackwell, a100, h100and h200

richardhundt

11 days ago

•

edited 11 days ago

I had to get Claude to knock out a few Triton kernels to patch the vLLM cu130 docker image. It took a few hours and I ended up with this monstrosity:

#!/bin/bash
# Run DeepSeek V4 Flash on SM120 (RTX PRO 6000 Blackwell) GPUs.
#
# SM120 issues and workarounds:
# 1. DeepGEMM's C++ kernels reject SM120 at runtime → patched deep_gemm.py
#    dispatches to Triton kernels for lightning indexer (paged + non-paged)
#    and keeps PyTorch fallbacks for tf32_hc_prenorm_gemm / fp8_einsum.
# 2. CUTLASS block-scaled FP8 kernel can't handle e8m0fnu scales through
#    the stable C API → disabled so Triton kernel is used instead.
# 3. Triton doesn't know float8_e8m0fnu dtype → patched triton.py converts
#    e8m0fnu scales to bfloat16 on-the-fly (lossless, same exponent range).
# 4. fused_inv_rope_fp8_quant produces packed INT32 UE8M0 scales on SM100+ →
#    patched deepseek_v4_attention.py forces SM90-style FP32 scales so the
#    PyTorch fp8_einsum fallback can dequantize them.
# 5. FlashMLA sparse_decode_fwd / sparse_prefill_fwd reject SM120 → patched
#    flash_mla_interface.py dispatches to our Triton kernels.

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
PATCH_DIR="$SCRIPT_DIR/sm120-patches"

docker run --gpus all \
  --ipc=host -p 8000:8000 \
  -e VLLM_DISABLED_KERNELS=CutlassFp8BlockScaledMMKernel \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$PATCH_DIR/deep_gemm.py":/usr/local/lib/python3.12/dist-packages/vllm/utils/deep_gemm.py:ro \
  -v "$PATCH_DIR/triton.py":/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/scaled_mm/triton.py:ro \
  -v "$PATCH_DIR/deepseek_v4_attention.py":/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/deepseek_v4_attention.py:ro \
  -v "$PATCH_DIR/flash_mla_interface.py":/usr/local/lib/python3.12/dist-packages/vllm/third_party/flashmla/flash_mla_interface.py:ro \
  -v "$PATCH_DIR/triton_kernels":/usr/local/lib/python3.12/dist-packages/sm120_triton_kernels:ro \
  -v "$PATCH_DIR/chat_template.jinja":/patches/chat_template.jinja:ro \
  vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Flash \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --tensor-parallel-size 2 \
  --enable-expert-parallel \
  --max-model-len 200000 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.92 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --compilation-config='{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1, 2, 4, 8, 16]}' \
  --enable-auto-tool-choice \
  --default-chat-template-kwargs '{"thinking": true, "reasoning_effort": "high"}' \
  --reasoning-parser deepseek_v4 \
  --chat-template /patches/chat_template.jinja \
  --attention-backend FLASHMLA_SPARSE

You're probably better off waiting for official support