Caxson commited on
Commit
97d882b
·
verified ·
1 Parent(s): f95170c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -0
README.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
2
+ DeepSeek-V4
3
+ Homepage Chat Hugging Face
4
+ Discord Wechat Twitter Follow
5
+ License
6
+ Technical Report👁️
7
+
8
+ Introduction
9
+ We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens.
10
+
11
+ DeepSeek-V4 series incorporate several key upgrades in architecture and optimization:
12
+
13
+ Hybrid Attention Architecture: We design a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to dramatically improve long-context efficiency. In the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2.
14
+ Manifold-Constrained Hyper-Connections (mHC): We incorporate mHC to strengthen conventional residual connections, enhancing stability of signal propagation across layers while preserving model expressivity.
15
+ Muon Optimizer: We employ the Muon optimizer for faster convergence and greater training stability.
16
+ We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline. The post-training features a two-stage paradigm: independent cultivation of domain-specific experts (through SFT and RL with GRPO), followed by unified model consolidation via on-policy distillation, integrating distinct proficiencies across diverse domains into a single model.
17
+
18
+ DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, significantly advances the knowledge capabilities of open-source models, firmly establishing itself as the best open-source model available today. It achieves top-tier performance in coding benchmarks and significantly bridges the gap with leading closed-source models on reasoning and agentic tasks. Meanwhile, DeepSeek-V4-Flash-Max achieves comparable reasoning performance to the Pro version when given a larger thinking budget, though its smaller parameter scale naturally places it slightly behind on pure knowledge tasks and the most complex agentic workflows.
19
+
20
+
21
+ Model Downloads
22
+ Model #Total Params #Activated Params Context Length Precision Download
23
+ DeepSeek-V4-Flash-Base 284B 13B 1M FP8 Mixed HuggingFace | ModelScope
24
+ DeepSeek-V4-Flash 284B 13B 1M FP4 + FP8 Mixed* HuggingFace | ModelScope
25
+ DeepSeek-V4-Pro-Base 1.6T 49B 1M FP8 Mixed HuggingFace | ModelScope
26
+ DeepSeek-V4-Pro 1.6T 49B 1M FP4 + FP8 Mixed* HuggingFace | ModelScope
27
+ *FP4 + FP8 Mixed: MoE expert parameters use FP4 precision; most other parameters use FP8.
28
+
29
+ Evaluation Results
30
+ Base Model
31
+ Benchmark (Metric) # Shots DeepSeek-V3.2-Base DeepSeek-V4-Flash-Base DeepSeek-V4-Pro-Base
32
+ Architecture - MoE MoE MoE
33
+ # Activated Params - 37B 13B 49B
34
+ # Total Params - 671B 284B 1.6T
35
+ World Knowledge
36
+ AGIEval (EM) 0-shot 80.1 82.6 83.1
37
+ MMLU (EM) 5-shot 87.8 88.7 90.1
38
+ MMLU-Redux (EM) 5-shot 87.5 89.4 90.8
39
+ MMLU-Pro (EM) 5-shot 65.5 68.3 73.5
40
+ MMMLU (EM) 5-shot 87.9 88.8 90.3
41
+ C-Eval (EM) 5-shot 90.4 92.1 93.1
42
+ CMMLU (EM) 5-shot 88.9 90.4 90.8
43
+ MultiLoKo (EM) 5-shot 38.7 42.2 51.1
44
+ Simple-QA verified (EM) 25-shot 28.3 30.1 55.2
45
+ SuperGPQA (EM) 5-shot 45.0 46.5 53.9
46
+ FACTS Parametric (EM) 25-shot 27.1 33.9 62.6
47
+ TriviaQA (EM) 5-shot 83.3 82.8 85.6
48
+ Language & Reasoning
49
+ BBH (EM) 3-shot 87.6 86.9 87.5
50
+ DROP (F1) 1-shot 88.2 88.6 88.7
51
+ HellaSwag (EM) 0-shot 86.4 85.7 88.0
52
+ WinoGrande (EM) 0-shot 78.9 79.5 81.5
53
+ CLUEWSC (EM) 5-shot 83.5 82.2 85.2
54
+ Code & Math
55
+ BigCodeBench (Pass@1) 3-shot 63.9 56.8 59.2
56
+ HumanEval (Pass@1) 0-shot 62.8 69.5 76.8
57
+ GSM8K (EM) 8-shot 91.1 90.8 92.6
58
+ MATH (EM) 4-shot 60.5 57.4 64.5
59
+ MGSM (EM) 8-shot 81.3 85.7 84.4
60
+ CMath (EM) 3-shot 92.6 93.6 90.9
61
+ Long Context
62
+ LongBench-V2 (EM) 1-shot 40.2 44.7 51.5
63
+ Instruct Model
64
+ DeepSeek-V4-Pro and DeepSeek-V4-Flash both support three reasoning effort modes:
65
+
66
+ Reasoning Mode Characteristics Typical Use Cases Response Format
67
+ Non-think Fast, intuitive responses Routine daily tasks, low-risk decisions </think> summary
68
+ Think High Conscious logical analysis, slower but more accurate Complex problem-solving, planning <think> thinking </think> summary
69
+ Think Max Push reasoning to its fullest extent Exploring the boundary of model reasoning capability Special system prompt + <think> thinking </think> summary
70
+ DeepSeek-V4-Pro-Max vs Frontier Models
71
+ Benchmark (Metric) Opus-4.6 Max GPT-5.4 xHigh Gemini-3.1-Pro High K2.6 Thinking GLM-5.1 Thinking DS-V4-Pro Max
72
+ Knowledge & Reasoning
73
+ MMLU-Pro (EM) 89.1 87.5 91.0 87.1 86.0 87.5
74
+ SimpleQA-Verified (Pass@1) 46.2 45.3 75.6 36.9 38.1 57.9
75
+ Chinese-SimpleQA (Pass@1) 76.4 76.8 85.9 75.9 75.0 84.4
76
+ GPQA Diamond (Pass@1) 91.3 93.0 94.3 90.5 86.2 90.1
77
+ HLE (Pass@1) 40.0 39.8 44.4 36.4 34.7 37.7
78
+ LiveCodeBench (Pass@1) 88.8 - 91.7 89.6 - 93.5
79
+ Codeforces (Rating) - 3168 3052 - - 3206
80
+ HMMT 2026 Feb (Pass@1) 96.2 97.7 94.7 92.7 89.4 95.2
81
+ IMOAnswerBench (Pass@1) 75.3 91.4 81.0 86.0 83.8 89.8
82
+ Apex (Pass@1) 34.5 54.1 60.9 24.0 11.5 38.3
83
+ Apex Shortlist (Pass@1) 85.9 78.1 89.1 75.5 72.4 90.2
84
+ Long Context
85
+ MRCR 1M (MMR) 92.9 - 76.3 - - 83.5
86
+ CorpusQA 1M (ACC) 71.7 - 53.8 - - 62.0
87
+ Agentic
88
+ Terminal Bench 2.0 (Acc) 65.4 75.1 68.5 66.7 63.5 67.9
89
+ SWE Verified (Resolved) 80.8 - 80.6 80.2 - 80.6
90
+ SWE Pro (Resolved) 57.3 57.7 54.2 58.6 58.4 55.4
91
+ SWE Multilingual (Resolved) 77.5 - - 76.7 73.3 76.2
92
+ BrowseComp (Pass@1) 83.7 82.7 85.9 83.2 79.3 83.4
93
+ HLE w/ tools (Pass@1) 53.1 52.0 51.6 54.0 50.4 48.2
94
+ GDPval-AA (Elo) 1619 1674 1314 1482 1535 1554
95
+ MCPAtlas Public (Pass@1) 73.8 67.2 69.2 66.6 71.8 73.6
96
+ Toolathlon (Pass@1) 47.2 54.6 48.8 50.0 40.7 51.8
97
+ Comparison across Modes
98
+ Benchmark (Metric) V4-Flash Non-Think V4-Flash High V4-Flash Max V4-Pro Non-Think V4-Pro High V4-Pro Max
99
+ Knowledge & Reasoning
100
+ MMLU-Pro (EM) 83.0 86.4 86.2 82.9 87.1 87.5
101
+ SimpleQA-Verified (Pass@1) 23.1 28.9 34.1 45.0 46.2 57.9
102
+ Chinese-SimpleQA (Pass@1) 71.5 73.2 78.9 75.8 77.7 84.4
103
+ GPQA Diamond (Pass@1) 71.2 87.4 88.1 72.9 89.1 90.1
104
+ HLE (Pass@1) 8.1 29.4 34.8 7.7 34.5 37.7
105
+ LiveCodeBench (Pass@1) 55.2 88.4 91.6 56.8 89.8 93.5
106
+ Codeforces (Rating) - 2816 3052 - 2919 3206
107
+ HMMT 2026 Feb (Pass@1) 40.8 91.9 94.8 31.7 94.0 95.2
108
+ IMOAnswerBench (Pass@1) 41.9 85.1 88.4 35.3 88.0 89.8
109
+ Apex (Pass@1) 1.0 19.1 33.0 0.4 27.4 38.3
110
+ Apex Shortlist (Pass@1) 9.3 72.1 85.7 9.2 85.5 90.2
111
+ Long Context
112
+ MRCR 1M (MMR) 37.5 76.9 78.7 44.7 83.3 83.5
113
+ CorpusQA 1M (ACC) 15.5 59.3 60.5 35.6 56.5 62.0
114
+ Agentic
115
+ Terminal Bench 2.0 (Acc) 49.1 56.6 56.9 59.1 63.3 67.9
116
+ SWE Verified (Resolved) 73.7 78.6 79.0 73.6 79.4 80.6
117
+ SWE Pro (Resolved) 49.1 52.3 52.6 52.1 54.4 55.4
118
+ SWE Multilingual (Resolved) 69.7 70.2 73.3 69.8 74.1 76.2
119
+ BrowseComp (Pass@1) - 53.5 73.2 - 80.4 83.4
120
+ HLE w/ tools (Pass@1) - 40.3 45.1 - 44.7 48.2
121
+ MCPAtlas (Pass@1) 64.0 67.4 69.0 69.4 74.2 73.6
122
+ GDPval-AA (Elo) - - 1395 - - 1554
123
+ Toolathlon (Pass@1) 40.7 43.5 47.8 46.3 49.0 51.8
124
+ Chat Template
125
+ This release does not include a Jinja-format chat template. Instead, we provide a dedicated encoding folder with Python scripts and test cases demonstrating how to encode messages in OpenAI-compatible format into input strings for the model, and how to parse the model's text output. Please refer to the encoding folder for full documentation.
126
+
127
+ A brief example:
128
+
129
+ from encoding_dsv4 import encode_messages, parse_message_from_completion_text
130
+
131
+ messages = [
132
+ {"role": "user", "content": "hello"},
133
+ {"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."},
134
+ {"role": "user", "content": "1+1=?"}
135
+ ]
136
+
137
+ # messages -> string
138
+ prompt = encode_messages(messages, thinking_mode="thinking")
139
+
140
+ # string -> tokens
141
+ import transformers
142
+ tokenizer = transformers.AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Pro")
143
+ tokens = tokenizer.encode(prompt)
144
+
145
+ How to Run Locally
146
+ Please refer to the inference folder for detailed instructions on running DeepSeek-V4 locally, including model weight conversion and interactive chat demos.
147
+
148
+ For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 1.0. For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens.
149
+
150
+ License
151
+ This repository and the model weights are licensed under the MIT License.
152
+
153
+ Citation
154
+ @misc{deepseekai2026deepseekv4,
155
+ title={DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
156
+ author={DeepSeek-AI},
157
+ year={2026},
158
+ }
159
+
160
+ Contact
161
+ If you have any questions, please raise an issue or contact us at service@deepseek.com.