Instructions to use unsloth/DeepSeek-V4-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/DeepSeek-V4-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="unsloth/DeepSeek-V4-Flash")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("unsloth/DeepSeek-V4-Flash")
model = AutoModelForCausalLM.from_pretrained("unsloth/DeepSeek-V4-Flash")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use unsloth/DeepSeek-V4-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/DeepSeek-V4-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/DeepSeek-V4-Flash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/unsloth/DeepSeek-V4-Flash

SGLang

How to use unsloth/DeepSeek-V4-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/DeepSeek-V4-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/DeepSeek-V4-Flash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/DeepSeek-V4-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/DeepSeek-V4-Flash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use unsloth/DeepSeek-V4-Flash with Docker Model Runner:
```
docker model run hf.co/unsloth/DeepSeek-V4-Flash
```

Worse than (smaller) MiniMax M2.7??

by deleted - opened 12 days ago

Discussion

deleted

12 days ago

Not a slight, as you can't win 'em all, but this is a larger, more recent model, and ArtificialAnalysis ranks this as worse than the (smaller, but also recent) minimax m2.7.

Is that truly the case, or is something wrong? Anything special in this vs. minimax that the benchmarks aren't reflecting well?

https://artificialanalysis.ai/?models=deepseek-v4-flash-high%2Cdeepseek-v4-flash%2Cminimax-m2-7%2Cstep-3-5-flash%2Cqwen3-5-397b-a17b-non-reasoning%2Cqwen3-5-122b-a10b%2Cstep-3-5-flash-0202&intelligence=agentic-index#artificial-analysis-agentic-index

WetRat

12 days ago

Anything special

A functioning 1M context length that shouldn't require a disgustingly big VRAM reserve (no idea how well Flash holds up against Pro, though).

is something wrong?

Yes. You're letting the benchmarks cloud your vision. The model is as innovative as it gets. Just a year ago, something like that was unheard of, with most users sticking to 32K - 64K context at the same VRAM usage.

Question is, for how long we'll have to wait until llama.cpp supports it properly and GGUFs start to roll out...

deleted

12 days ago

Yes. You're letting the benchmarks cloud your vision.

Keep it respectful, please; I didn't invite ad hominem, and there's no reason to assume it's acceptable.

WetRat

12 days ago

Keep it respectful, please; I didn't invite ad hominem, and there's no reason to assume it's acceptable.

I didn't mean to offend you, but it's a real issue these days - people focusing too much on benchmark results.

Nycnt

12 days ago

Not a slight, as you can't win 'em all, but this is a larger, more recent model, and ArtificialAnalysis ranks this as worse than the (smaller, but also recent) minimax m2.7.

Is that truly the case, or is something wrong? Anything special in this vs. minimax that the benchmarks aren't reflecting well?

https://artificialanalysis.ai/?models=deepseek-v4-flash-high%2Cdeepseek-v4-flash%2Cminimax-m2-7%2Cstep-3-5-flash%2Cqwen3-5-397b-a17b-non-reasoning%2Cqwen3-5-122b-a10b%2Cstep-3-5-flash-0202&intelligence=agentic-index#artificial-analysis-agentic-index

My dude. It's been out only a few hours. It's not even merged into mainline transformers last I checked.

Just chill.

MartinPatterson

12 days ago

It's not that far from Minimax 2.7 (AAI 50 vs 47) and it could be still a winner if it can take aggressive quantization better than MM.

mtcl

12 days ago

Yes. You're letting the benchmarks cloud your vision.

Keep it respectful, please; I didn't invite ad hominem, and there's no reason to assume it's acceptable.

I thought it was very very respectful the way he worded it. Benchmarks do not do justice.

that being said, I have 2X6000 pros and hopefully i can fit this flash version with good context numbers and still have a functioning 1M context model with some overflow to system ram.

If KTransformers adapts to this, that will be beautiful.

Exostyle81

12 days ago

it's a preview my dear

cloudyu

12 days ago

MiniMax M2.7 looks overfitting benchmark. I ask Claude to test it via 14 tasks generated by Sonnet, only one passes while 13 fail.

MartinPatterson

12 days ago

MiniMax M2.7 looks overfitting benchmark. I ask Claude to test it via 14 tasks generated by Sonnet, only one passes while 13 fail.

Just curious, did you compare that same test M2.5?

cloudyu

12 days ago

MiniMax M2.7 looks overfitting benchmark. I ask Claude to test it via 14 tasks generated by Sonnet, only one passes while 13 fail.

Just curious, did you compare that same test M2.5?

No，no resouce and time do that. However, I tested serveral gpt oss 120B related model, they can pass 10 to 14 of 14 at the same test. one of them can be found https://huggingface.co/cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled; these tests are generated by Sonnet 4.6 and then verified by Gemini Pro.

MartinPatterson

12 days ago

•

edited 12 days ago

thanks fo

MiniMax M2.7 looks overfitting benchmark. I ask Claude to test it via 14 tasks generated by Sonnet, only one passes while 13 fail.

Just curious, did you compare that same test M2.5?

No，no resouce and time do that. However, I tested serveral gpt oss 120B related model, they can pass 10 to 14 of 14 at the same test. one of them can be found https://huggingface.co/cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled; these tests are generated by Sonnet 4.6 and then verified by Gemini Pro.

thanks for elaborating! I assume if those would've been coding tests, then MM 2.7 would have done pretty well... equal to Claude etc SOTA models.. right?

looking closer there were 5 coding tests, and it failed there too... hmm

code_01 Median of Two Sorted Arrays ✅ PASS 100% 25.0s 35
code_02 Thread-Safe LRU Cache with TTL ✅ PASS 100% 38.9s 36
code_03 Persistent Segment Tree — K-th Query ✅ PASS 100% 45.9s 35
code_04 Multi-Head Attention + RoPE (NumPy) ✅ PASS 100% 38.0s 33
code_05 Dijkstra vs A* on Large Random Graph ✅ PASS 100% 26.1s 34

cloudyu

12 days ago

thanks fo

MiniMax M2.7 looks overfitting benchmark. I ask Claude to test it via 14 tasks generated by Sonnet, only one passes while 13 fail.

Just curious, did you compare that same test M2.5?

No，no resouce and time do that. However, I tested serveral gpt oss 120B related model, they can pass 10 to 14 of 14 at the same test. one of them can be found https://huggingface.co/cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled; these tests are generated by Sonnet 4.6 and then verified by Gemini Pro.

thanks for elaborating! I assume if those would've been coding tests, then MM 2.7 would have done pretty well... equal to Claude etc SOTA models.. right?

looking closer there were 5 coding tests, and it failed there too... hmm

code_01 Median of Two Sorted Arrays ✅ PASS 100% 25.0s 35
code_02 Thread-Safe LRU Cache with TTL ✅ PASS 100% 38.9s 36
code_03 Persistent Segment Tree — K-th Query ✅ PASS 100% 45.9s 35
code_04 Multi-Head Attention + RoPE (NumPy) ✅ PASS 100% 38.0s 33
code_05 Dijkstra vs A* on Large Random Graph ✅ PASS 100% 26.1s 34

here, I report the issue, but you need translate https://huggingface.co/MiniMaxAI/MiniMax-M2.7/discussions/16#69de35f3c4cb3fe610074295

EurusWine

9 days ago

Yes. You're letting the benchmarks cloud your vision.

Keep it respectful, please; I didn't invite ad hominem, and there's no reason to assume it's acceptable.

You did not respect DeepSeek first

ZekiHVH

7 days ago

Yes. You're letting the benchmarks cloud your vision.

Keep it respectful, please; I didn't invite ad hominem, and there's no reason to assume it's acceptable.

That is not ad hominem, they're targeting your argument directly, you were utilizing benchmarks as sole indicator of quality and innovation, they called you out, just that.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment