Instructions to use deepseek-ai/DeepSeek-V4-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepseek-ai/DeepSeek-V4-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use deepseek-ai/DeepSeek-V4-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepseek-ai/DeepSeek-V4-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash

SGLang

How to use deepseek-ai/DeepSeek-V4-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepseek-ai/DeepSeek-V4-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepseek-ai/DeepSeek-V4-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Flash with Docker Model Runner:
```
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
```

How to run deepseek on Ada GPUs？Mine is L20.

#25

by XiaoZaiyi - opened 8 days ago

Discussion

XiaoZaiyi

8 days ago

Does the L20 card not support this model? I am using vllm.

mattduerrmeier

5 days ago

The L20 GPU has 48 GB of memory, so you don't have enough space to load the DeepSeek-V4 models. From my understanding you need at least 158~ GB of memory for V4-Flash.

XiaoZaiyi

5 days ago

The L20 GPU has 48 GB of memory, so you don't have enough space to load the DeepSeek-V4 models. From my understanding you need at least 158~ GB of memory for V4-Flash.

I have 8*L20. GPU memory enough， The architecture simply doesn't support running it.

S1quence

3 days ago

The L20 GPU has 48 GB of memory, so you don't have enough space to load the DeepSeek-V4 models. From my understanding you need at least 158~ GB of memory for V4-Flash.

I have 8*L20. GPU memory enough， The architecture simply doesn't support running it.

This PR may help you, I have not tried this PR yet. https://github.com/vllm-project/vllm/pull/40906 But seems the decoding speed is not satisfying.

mattduerrmeier

2 days ago

Unfortunately L20 is SM89, so it will not be officially supported by vLLM. From: https://github.com/vllm-project/vllm/issues/40902:

We don't plan to support hardwares under SM90 in the official repo since that will introduce significant maintenance overhead.

The PR is your best bet. Alternatively, start from the inference code provided with DeepSeek-V4.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment