Instructions to use deepseek-ai/DeepSeek-V4-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepseek-ai/DeepSeek-V4-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use deepseek-ai/DeepSeek-V4-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepseek-ai/DeepSeek-V4-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash

SGLang

How to use deepseek-ai/DeepSeek-V4-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepseek-ai/DeepSeek-V4-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepseek-ai/DeepSeek-V4-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Flash with Docker Model Runner:
```
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
```

Will there be a smaller model like Qwen3.5 122 or Nemotron 3 super

by mayankiit04 - opened 13 days ago

Discussion

mayankiit04

13 days ago

Will there be a smaller model like Qwen3.5 122 or Nemotron 3 super, that will help average joes.

WetRat

13 days ago

This model is almost 6 times smaller than the big one. It's quite literally THE first time they release a smaller model and you should be grateful it even exists.
"Flash" in relation to 1.6T is what 80B Qwen is, in relation to 400B Qwen (you get the idea - there's no Qwen MAX available in open weights; NVIDIA doesn't have any huge models available either).

NodeLinker

13 days ago

This model is almost 6 times smaller than the big one. It's quite literally THE first time they release a smaller model and you should be grateful it even exists.
"Flash" in relation to 1.6T is what 80B Qwen is, in relation to 400B Qwen (you get the idea - there's no Qwen MAX available in open weights; NVIDIA doesn't have any huge models available either).

Well, models of that size might be seen in the future — maybe a year, two, three? I agree with him on one point: DeepSeek has consistently released official distillates that you can run at home. They took Qwen’s base model and trained it using SFT on 800k prompts. Now… It seems like they could release an 8B distillate model, but DeepSeek needs money and it’s more profitable for them if we use their models via API, because right now their API margin is huge — if we believe the article they published.

Dampfinchen

12 days ago

•

edited 12 days ago

122B is still way too big. We need a smaller MoE like Gemma 4 26b a4b. The average Joe doesn't have 128 GB RAM or even 64 GB, but 32 GB.

asher9972

12 days ago

@Dampfinchen the average Joe has 2x rtx 6000 pro.

We need more 100-220b models because they are really capable of doing things right and get work done..
everything below is just playing around...

Dampfinchen

12 days ago

•

edited 12 days ago

@Dampfinchen the average Joe has 2x rtx 6000 pro.

We need more 100-220b models because they are really capable of doing things right and get work done..
everything below is just playing around...

What are you talking about? You must live in a strange bubble if you think the average Joe has that much VRAM. A RTX 6000 Pro costs around 10K lol. Wtf are you on about. Most people have 32 GB RAM and 8 GB VRAM, as per Steam Hardware Survey. https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam

What you are talking about are professional use cases, AI datacenters. Not even prosumers (prosumers have around 48 GB VRAM max.)

ztsvvstz

12 days ago

@Dampfinchen the average Joe has 2x rtx 6000 pro.

We need more 100-220b models because they are really capable of doing things right and get work done..
everything below is just playing around...

What are you talking about? You must live in a strange bubble if you think the average Joe has that much VRAM. A RTX 6000 Pro costs around 10K lol. Wtf are you on about. Most people have 32 GB RAM and 8 GB VRAM, as per Steam Hardware Survey. https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam

What you are talking about are professional use cases, AI datacenters. Not even prosumers (prosumers have around 48 GB VRAM max.)

honestly, I dont have much ressources but I managed to gather 3x rtx3090 (used) for a total of less than 1.5k over like 2 years, and have like 72gb VRAM
Right now I would say 80-100b is perfect, the qwen 80b coder next one being suprisingly capable.
qwen 122b is just a tad too big, but if I got my hands on one more rtx3090 I too, would argue 100-200b moe is the perfect size.
The ~30b models feel like not a big difference at first to 100b, but on longer runs and more niche tasks you start to notice that there's just some intelligence missing.
E.g. qwen 35-a3b makes grammatical mistakes when speaking german like in every second sentence, while qwen 80b does that too but way, way less frequently.

Wouldnt agree that the average joe has 2x rtx 6000 but would also at the same time say that it is kind of dumb to think that you need the "newest" most expensive nvidia chips to work with
Also can always buy this stuff used.
The average enthusiast only needs a few rtx 3090's

Quadrapole

8 days ago

i will just add that i too am an average joe with 2x rtx 6000 pro and this is a good size to actually fit on and do useful work.

if only they enabled sm120 to work at default...

but still, thanks deepseek v4! Just waiting for sm120 support.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment