Instructions to use deepseek-ai/DeepSeek-V4-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepseek-ai/DeepSeek-V4-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use deepseek-ai/DeepSeek-V4-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepseek-ai/DeepSeek-V4-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash

SGLang

How to use deepseek-ai/DeepSeek-V4-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepseek-ai/DeepSeek-V4-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepseek-ai/DeepSeek-V4-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Flash with Docker Model Runner:
```
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
```

Too big to run locally.

#12

by Dampfinchen - opened 12 days ago

Discussion

Dampfinchen

12 days ago

Please also release a smaller model so people with average hardware (32 GB RAM) can run it. See Gemma 4 26B A4B and Qwen 3.6 35B A3B.

NodeLinker

12 days ago

Please also release a smaller model so people with average hardware (32 GB RAM) can run it. See Gemma 4 26B A4B and Qwen 3.6 35B A3B.

Dude, I already explained earlier why DeepSeek won’t do that, but no — not only did you leave a message in the previous discussion, you also started a new one. Look at the Pro discussions — the people from DeepSeek read everything there. If you really want a small model, just rent two H200s or a B200, and distill and train your own model based on Qwen Base. I’d actually be glad if you did that, because Qwen is heavily censored and not suitable for creative writing.

asher9972

12 days ago

We need more 120-200b models!!!!

Dampfinchen

12 days ago

•

edited 12 days ago

Please also release a smaller model so people with average hardware (32 GB RAM) can run it. See Gemma 4 26B A4B and Qwen 3.6 35B A3B.

Dude, I already explained earlier why DeepSeek won’t do that, but no — not only did you leave a message in the previous discussion, you also started a new one. Look at the Pro discussions — the people from DeepSeek read everything there. If you really want a small model, just rent two H200s or a B200, and distill and train your own model based on Qwen Base. I’d actually be glad if you did that, because Qwen is heavily censored and not suitable for creative writing.

Distills are terrible though, we need the same Deepseek V4 architecture but scaled down to run on consumer hardware. Qwen doesn't have the same attention mechanisms that make deepseek v4 so lightweight and good.

See Deepseek V2 Lite, that was exactly that, so they have done it in the past and can do it again. We need a Deepseek V4 Lite, but a little bigger more like 30b a5b.

We need more 120-200b models!!!!

This is that though. It has around 258b parameters, so you already got what you want.

naplam

12 days ago

Please also release a smaller model so people with average hardware (32 GB RAM) can run it. See Gemma 4 26B A4B and Qwen 3.6 35B A3B.

Dude, I already explained earlier why DeepSeek won’t do that, but no — not only did you leave a message in the previous discussion, you also started a new one. Look at the Pro discussions — the people from DeepSeek read everything there. If you really want a small model, just rent two H200s or a B200, and distill and train your own model based on Qwen Base. I’d actually be glad if you did that, because Qwen is heavily censored and not suitable for creative writing.

DeepSeek 3.2 was heavily censored in my experience, at least any time you mention China (much more than 3.1 terminus). I'm curious about v4 but the trend was not in our favor...

phil111

12 days ago

•

edited 12 days ago

@NodeLinker Can you briefly re-summarize why DeepSeek won't release a smaller model? It seems to me that going through the process of making a smaller version of DS would provide them insights into how they can make their larger models better. For example, pushing things to the extreme is a reliable way to shine a light on weaknesses in a model's design.

Also, while you're right about Qwen being overly censored and bad at creative writing, the truth is both Qwen bases & fine-tunes are filled with gaping holes and are notably inferior to other model families like Gemma 4 despite their suspiciously high tests scores, and this is why across all measured categories on LMsys Q3.5 ranks notably lower than G4.

For example, without thinking enabled they reliably make stupid errors, such as repeats in synonym lists, or using the word as its own synonym, or during grammar checking listing corrections it never even made, and so on. Plus in long outputs like stories it will consistently contradict itself and the user's prompt, and become highly repetitive in response to wildly different prompts. And with thinking enabled it burns through far more tokens, and is more likely to get stuck in infinite loops (even with presence penalty 1.5).

Point being, in the real-word Qwen models are far weaker than what their tests scores suggest, with the possible exception of coding, which is why it would be nice to have a Chinese company with greater integrity like DeepSeek throw their hat in the ring. I suspect DeepSeek, unlike Alibaba, is capable of making a model that fits in 32 GB consumer hardware that can genuinely compete with Gemma 4.

curtis1969

12 days ago

they never make their own small models they always distill them into Qwen models why would they start making them now and what exactly gives deep seek more integrity Janus was the only model they made i can run and some other ones i don't care about

phil111

12 days ago

@curtis1969 DeepSeek initially released smaller models like DeepSeek 67b and DeepSeek 16b MOE.

They have integrity because, at least so far, they train democratically on all of humanity's popular data vs almost exclusively focusing on the data that maximizes popular test scores, such as STEM and coding data. For example, they have high (and legitimate) English SimpleQA scores. Plus their test scores accurately represent their performance. Plus they actively publish groundbreaking research ASAP (e.g. MOE and ngram).

Anyways, their distilled versions of Qwen are pointless because it's their architecture and balanced training that makes DeepSeek special, not their fine-tuning. So distilling DeepSeek into a Qwen base (which is a generally very weak and very lopsided base) just results in crap.

CHNtentes

7 days ago

@curtis1969 DeepSeek initially released smaller models like DeepSeek 67b and DeepSeek 16b MOE.

They have integrity because, at least so far, they train democratically on all of humanity's popular data vs almost exclusively focusing on the data that maximizes popular test scores, such as STEM and coding data. For example, they have high (and legitimate) English SimpleQA scores. Plus their test scores accurately represent their performance. Plus they actively publish groundbreaking research ASAP (e.g. MOE and ngram).

Anyways, their distilled versions of Qwen are pointless because it's their architecture and balanced training that makes DeepSeek special, not their fine-tuning. So distilling DeepSeek into a Qwen base (which is a generally very weak and very lopsided base) just results in crap.

@phil111 Have you compared DeepSeek V4 series to V3.X ones on knowledge breadth?

phil111

7 days ago

@CHNtentes No, I haven't compared DSv4 to v3. I only fully test knowledge breadth on models that can fit within 64 GB of RAM @ Q4_K_M. But I spot check them online and the DeepSeek models appear to have the broad knowledge indicated by their English SimpleQA scores.

I appreciate DeepSeek's integrity because at least a couple models (Ernie 24.2b-a3b and Qwen3-235B-A22B) flat out and egregiously cheated on the English SimpleQA . For example, Qwen3-235b scored 54.3, which shocked me because this score is higher than the much larger and English-focused Opus and GPT4x models so I tested it in depth and it only had the broad knowledge that indicated an English SimpleQA score of ~17. And Ernie 4.5 21b-a3b claimed to have an English SimpleQA of 24.2, which initially got me excited because the previous best for this model size was ~10, but it turned out to have unusually low broad English knowledge for its size, indicative of an English SimpleQA score of ~3.

Hypersniper

4 days ago

Deepseek is competing with the closed source models not the smaller ones that another Chinese company already dominates in.

phil111

4 days ago

@Hypersniper There's no Chinese company dominating smaller models. I'm assuming you're referring to Alibaba with their smaller Qwen models, which I assure are not good. None of their test scores reflect their real-world abilities, plus they grossly overfit a handful of domains, and are otherwise generally incompetent.

Alibaba didn't dominate the small OS LLM ecosystem. They destroyed it. They were actually a generation behind when making the Qwen2.5 and Qwen3 series and didn't have any secret sauce. What they did was artificially boost test scores and grossly overfit select domains, forcing other companies to either do the same and give up their integrity, or appear noncompetitive.

This wouldn't have worked if the early adopting community wasn't comprised almost entirely of narrowly focused coders who praised Alibaba instead of calling them out on their test maxing and overfitting. Qwen3.5 4b simply does not have a legitimate MMLU-pro score of 79, and certainly Qwen3 235b does not have a SimpleQA of 20, let alone 54.3. Again, Alibaba is a low integrity company that destroyed the small OS LLM ecosystem.

naplam

4 days ago

@Hypersniper There's no Chinese company dominating smaller models. I'm assuming you're referring to Alibaba with their smaller Qwen models, which I assure are not good. None of their test scores reflect their real-world abilities, plus they grossly overfit a handful of domains, and are otherwise generally incompetent.

OK. Can you name several small models that are similar or better than Qwen 3.6 at similar sizes (like 27B, 35BA3B, 100B), please? I'll start: Gemma 4 is similar.

phil111

3 days ago

•

edited 3 days ago

@naplam Because most companies stopped making small OS models (Mistral, LLama, Cohere...) Gemma 4 is the only competitive model, but it's notably stronger than Qwen3.5. However, Both grossly overfit select domains (e.g. have orders of magnitude more science vs pop culture knowledge), making them near useless to the general population.

Anyways, Qwen3.5 burns through a lot more thinking tokens than Gemma 4, and if you look at the thinking strands they're basically just throwing random things at the wall hoping something will stick. Plus it's more likely to start infinitely looping, even with their recommended patch (presence penalty of 1.5) which increases hallucinations because in a lot of circumstances the only functional token is the one that was already used, so forcing a presence penalty of 1.5 increases the odds that a nonsense token will be selected, or at least less than ideal one leading Qwen down a bad path.

Also, without thinking (and even with thinking) Qwen3.5 makes far more stupid blunders than Gemma 4, or for that matter earlier models like Gemma 3 and Llama 3, such as using the same word you're asking synonyms for in the list of synonyms, or repeating the same words in the list. Another example is misidentifying grammar corrections it made, or didn't make.

Also, Qwem3.5 is more repetitive and writes notably inferior stories and poems that are riddled with more egregious contradictions in response to complex prompts that use inclusions and exclusions to force originality vs simply parroting stories and poems in the training data.

Anyways, Qwen 3.5 is across the board weaker than Gemma 4, including when it comes to complex image analysis tasks. On top of this they overfit coding even more in Qwen3.6, causing its broad performance to collapse even more relative to Qwen 3.5. And the smaller models like Qwen3.5 9 and 4b are simply horrendous despite their ridiculously high test scores (e.g. Qwen3.5 4b's bullshit MMLU-Pro score of 79). Alibaba is making broadly weak test maxed and overfit smaller models that nobody can compete with on paper and the early adopting community is too biased and oblivious to call them out on it, and instead started criticizing other models for having lower test scores, including forcing Qwen's nonsense test scores onto the model pages of other companies until they stopped releasing them.

curtis1969

2 days ago

•

edited 2 days ago

this is completely wrong they've definitely improved its ability to fetch information using tools that's the whole point of Qwen3.6 is not to try and store all the information but be really efficient about grabbing that information with tools for me Gemma 4 wants to use internal knowledge Qwen3.6 it's an amazing set of models far better than any other model I've used in every test i never try to just use these models based on their internal knowledge Qwen3.6 35b is a way better model than this trash go ahead use this to write little stupid poems written by AI are always gonna be garbage and soulless Gemma 4 is better in certain chat scenarios but I think that's just because it was probably trained on more English than Qwen3.6

phil111

2 days ago

@curtis1969 Yes, Qwen 3.6 has been improved for things like tool use, plus it has better long context management than Gemma 4. But you're wrong about belittling all other tasks, such as garbage soulless poems. Qwen 3.6 makes stupid errors across the board, even when it comes to simple things like synonym lists.

An AI model, especially one that hopes to achieve AGI, is all about responding appropriately to a broad spectrum of prompts from the wide spectrum of people in society. Tool calling is nothing more than a function. It has relatively little to do with AI. Valuing this above broad functionality by belittling everything else (stupid poems) is at the core of the problem.

Tool calling is like giving a person a calculator. It has almost nothing to do with the person or AI model. I understand people have specific things they want out of AI, but don't allow this to blind you to fundamental truths. Qwen 3.6 is an overall inferior AI model to Qwen 3.5 across most tasks, and Qwen 3.5 is overall far behind Gemma 4, as well as most other AI models. Simply put, Alibaba is trading general AI capabilities for gains in specific high precision tasks like function calling, making it more akin to a program or algorithm than an AI model.

curtis1969

2 days ago

give it at local server link tell it to make you a inventory management system character management and money management system it's able to make all the proper systems and then also use those systems and then you can fully customize it to anyway you feel fit I mean seriously if you're sitting here reading poems I don't even wanna argue with you over agi or the storytelling abilities of these models all of them i used so far are decent enough storytellers if you give them enough context I even like to have it make me a set of questions that I can just click on or have the ability to type my own see I wanna have it to be able to make all of these things and then be able to use them so i need it to code and chat and use the website https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates have you used this template with qwen 3.6 also it's front end design it's just amazing and you can plug in different models Gemma 4 is a general purpose model qwen 3.6 as you can read from what they posted is more for coding

curtis1969

2 days ago

•

edited 2 days ago

By not releasing smaller models for local hosting, they are incentivizing users to use their cloud API (DeepSeek AI). If the local models are too large users might prefer using the company's optimized cloud services where they can pay per token

phil111

2 days ago

@curtis1969 It seems like we're in agreement when it comes to G4 being a more general purpose model, and Q3.6 being better at coding.

However, technically G4 is still comparable at coding, and even has a higher coding score on LMsys. But coders (not one myself) have repeatedly claimed that when it comes to complex projects with large code bases that Qwen's more accurate context window, precise tool calling, etc. makes it perform better.

But even then Qwen burns through a lot more tokens, starts repeating/looping more, makes a higher frequency of boneheaded mistakes, and is generally weaker.

curtis1969

2 days ago

G4 is still a really good model no doubt for me g4 is better at pure logic but it seems less proactive than Q3.6 to me also I do agree in general chat it tends to overthink it treats everything like a test but that being said sometimes I get really quick responses if I'm reading its output and I see it start to say wait I usually delete the output and then I I reword my prompt a little bit and then I resend it and most of the time fixes it like whatever it was confused about I just provided that information which can be pretty aggravating also time is a problem for models they wanna treat the web searches as future searches that aren't real which can be very aggravating especially when you give it the date in system prompt

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment