Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,161 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
|
| 2 |
+
DeepSeek-V4
|
| 3 |
+
Homepage Chat Hugging Face
|
| 4 |
+
Discord Wechat Twitter Follow
|
| 5 |
+
License
|
| 6 |
+
Technical Report👁️
|
| 7 |
+
|
| 8 |
+
Introduction
|
| 9 |
+
We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens.
|
| 10 |
+
|
| 11 |
+
DeepSeek-V4 series incorporate several key upgrades in architecture and optimization:
|
| 12 |
+
|
| 13 |
+
Hybrid Attention Architecture: We design a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to dramatically improve long-context efficiency. In the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2.
|
| 14 |
+
Manifold-Constrained Hyper-Connections (mHC): We incorporate mHC to strengthen conventional residual connections, enhancing stability of signal propagation across layers while preserving model expressivity.
|
| 15 |
+
Muon Optimizer: We employ the Muon optimizer for faster convergence and greater training stability.
|
| 16 |
+
We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline. The post-training features a two-stage paradigm: independent cultivation of domain-specific experts (through SFT and RL with GRPO), followed by unified model consolidation via on-policy distillation, integrating distinct proficiencies across diverse domains into a single model.
|
| 17 |
+
|
| 18 |
+
DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, significantly advances the knowledge capabilities of open-source models, firmly establishing itself as the best open-source model available today. It achieves top-tier performance in coding benchmarks and significantly bridges the gap with leading closed-source models on reasoning and agentic tasks. Meanwhile, DeepSeek-V4-Flash-Max achieves comparable reasoning performance to the Pro version when given a larger thinking budget, though its smaller parameter scale naturally places it slightly behind on pure knowledge tasks and the most complex agentic workflows.
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
Model Downloads
|
| 22 |
+
Model #Total Params #Activated Params Context Length Precision Download
|
| 23 |
+
DeepSeek-V4-Flash-Base 284B 13B 1M FP8 Mixed HuggingFace | ModelScope
|
| 24 |
+
DeepSeek-V4-Flash 284B 13B 1M FP4 + FP8 Mixed* HuggingFace | ModelScope
|
| 25 |
+
DeepSeek-V4-Pro-Base 1.6T 49B 1M FP8 Mixed HuggingFace | ModelScope
|
| 26 |
+
DeepSeek-V4-Pro 1.6T 49B 1M FP4 + FP8 Mixed* HuggingFace | ModelScope
|
| 27 |
+
*FP4 + FP8 Mixed: MoE expert parameters use FP4 precision; most other parameters use FP8.
|
| 28 |
+
|
| 29 |
+
Evaluation Results
|
| 30 |
+
Base Model
|
| 31 |
+
Benchmark (Metric) # Shots DeepSeek-V3.2-Base DeepSeek-V4-Flash-Base DeepSeek-V4-Pro-Base
|
| 32 |
+
Architecture - MoE MoE MoE
|
| 33 |
+
# Activated Params - 37B 13B 49B
|
| 34 |
+
# Total Params - 671B 284B 1.6T
|
| 35 |
+
World Knowledge
|
| 36 |
+
AGIEval (EM) 0-shot 80.1 82.6 83.1
|
| 37 |
+
MMLU (EM) 5-shot 87.8 88.7 90.1
|
| 38 |
+
MMLU-Redux (EM) 5-shot 87.5 89.4 90.8
|
| 39 |
+
MMLU-Pro (EM) 5-shot 65.5 68.3 73.5
|
| 40 |
+
MMMLU (EM) 5-shot 87.9 88.8 90.3
|
| 41 |
+
C-Eval (EM) 5-shot 90.4 92.1 93.1
|
| 42 |
+
CMMLU (EM) 5-shot 88.9 90.4 90.8
|
| 43 |
+
MultiLoKo (EM) 5-shot 38.7 42.2 51.1
|
| 44 |
+
Simple-QA verified (EM) 25-shot 28.3 30.1 55.2
|
| 45 |
+
SuperGPQA (EM) 5-shot 45.0 46.5 53.9
|
| 46 |
+
FACTS Parametric (EM) 25-shot 27.1 33.9 62.6
|
| 47 |
+
TriviaQA (EM) 5-shot 83.3 82.8 85.6
|
| 48 |
+
Language & Reasoning
|
| 49 |
+
BBH (EM) 3-shot 87.6 86.9 87.5
|
| 50 |
+
DROP (F1) 1-shot 88.2 88.6 88.7
|
| 51 |
+
HellaSwag (EM) 0-shot 86.4 85.7 88.0
|
| 52 |
+
WinoGrande (EM) 0-shot 78.9 79.5 81.5
|
| 53 |
+
CLUEWSC (EM) 5-shot 83.5 82.2 85.2
|
| 54 |
+
Code & Math
|
| 55 |
+
BigCodeBench (Pass@1) 3-shot 63.9 56.8 59.2
|
| 56 |
+
HumanEval (Pass@1) 0-shot 62.8 69.5 76.8
|
| 57 |
+
GSM8K (EM) 8-shot 91.1 90.8 92.6
|
| 58 |
+
MATH (EM) 4-shot 60.5 57.4 64.5
|
| 59 |
+
MGSM (EM) 8-shot 81.3 85.7 84.4
|
| 60 |
+
CMath (EM) 3-shot 92.6 93.6 90.9
|
| 61 |
+
Long Context
|
| 62 |
+
LongBench-V2 (EM) 1-shot 40.2 44.7 51.5
|
| 63 |
+
Instruct Model
|
| 64 |
+
DeepSeek-V4-Pro and DeepSeek-V4-Flash both support three reasoning effort modes:
|
| 65 |
+
|
| 66 |
+
Reasoning Mode Characteristics Typical Use Cases Response Format
|
| 67 |
+
Non-think Fast, intuitive responses Routine daily tasks, low-risk decisions </think> summary
|
| 68 |
+
Think High Conscious logical analysis, slower but more accurate Complex problem-solving, planning <think> thinking </think> summary
|
| 69 |
+
Think Max Push reasoning to its fullest extent Exploring the boundary of model reasoning capability Special system prompt + <think> thinking </think> summary
|
| 70 |
+
DeepSeek-V4-Pro-Max vs Frontier Models
|
| 71 |
+
Benchmark (Metric) Opus-4.6 Max GPT-5.4 xHigh Gemini-3.1-Pro High K2.6 Thinking GLM-5.1 Thinking DS-V4-Pro Max
|
| 72 |
+
Knowledge & Reasoning
|
| 73 |
+
MMLU-Pro (EM) 89.1 87.5 91.0 87.1 86.0 87.5
|
| 74 |
+
SimpleQA-Verified (Pass@1) 46.2 45.3 75.6 36.9 38.1 57.9
|
| 75 |
+
Chinese-SimpleQA (Pass@1) 76.4 76.8 85.9 75.9 75.0 84.4
|
| 76 |
+
GPQA Diamond (Pass@1) 91.3 93.0 94.3 90.5 86.2 90.1
|
| 77 |
+
HLE (Pass@1) 40.0 39.8 44.4 36.4 34.7 37.7
|
| 78 |
+
LiveCodeBench (Pass@1) 88.8 - 91.7 89.6 - 93.5
|
| 79 |
+
Codeforces (Rating) - 3168 3052 - - 3206
|
| 80 |
+
HMMT 2026 Feb (Pass@1) 96.2 97.7 94.7 92.7 89.4 95.2
|
| 81 |
+
IMOAnswerBench (Pass@1) 75.3 91.4 81.0 86.0 83.8 89.8
|
| 82 |
+
Apex (Pass@1) 34.5 54.1 60.9 24.0 11.5 38.3
|
| 83 |
+
Apex Shortlist (Pass@1) 85.9 78.1 89.1 75.5 72.4 90.2
|
| 84 |
+
Long Context
|
| 85 |
+
MRCR 1M (MMR) 92.9 - 76.3 - - 83.5
|
| 86 |
+
CorpusQA 1M (ACC) 71.7 - 53.8 - - 62.0
|
| 87 |
+
Agentic
|
| 88 |
+
Terminal Bench 2.0 (Acc) 65.4 75.1 68.5 66.7 63.5 67.9
|
| 89 |
+
SWE Verified (Resolved) 80.8 - 80.6 80.2 - 80.6
|
| 90 |
+
SWE Pro (Resolved) 57.3 57.7 54.2 58.6 58.4 55.4
|
| 91 |
+
SWE Multilingual (Resolved) 77.5 - - 76.7 73.3 76.2
|
| 92 |
+
BrowseComp (Pass@1) 83.7 82.7 85.9 83.2 79.3 83.4
|
| 93 |
+
HLE w/ tools (Pass@1) 53.1 52.0 51.6 54.0 50.4 48.2
|
| 94 |
+
GDPval-AA (Elo) 1619 1674 1314 1482 1535 1554
|
| 95 |
+
MCPAtlas Public (Pass@1) 73.8 67.2 69.2 66.6 71.8 73.6
|
| 96 |
+
Toolathlon (Pass@1) 47.2 54.6 48.8 50.0 40.7 51.8
|
| 97 |
+
Comparison across Modes
|
| 98 |
+
Benchmark (Metric) V4-Flash Non-Think V4-Flash High V4-Flash Max V4-Pro Non-Think V4-Pro High V4-Pro Max
|
| 99 |
+
Knowledge & Reasoning
|
| 100 |
+
MMLU-Pro (EM) 83.0 86.4 86.2 82.9 87.1 87.5
|
| 101 |
+
SimpleQA-Verified (Pass@1) 23.1 28.9 34.1 45.0 46.2 57.9
|
| 102 |
+
Chinese-SimpleQA (Pass@1) 71.5 73.2 78.9 75.8 77.7 84.4
|
| 103 |
+
GPQA Diamond (Pass@1) 71.2 87.4 88.1 72.9 89.1 90.1
|
| 104 |
+
HLE (Pass@1) 8.1 29.4 34.8 7.7 34.5 37.7
|
| 105 |
+
LiveCodeBench (Pass@1) 55.2 88.4 91.6 56.8 89.8 93.5
|
| 106 |
+
Codeforces (Rating) - 2816 3052 - 2919 3206
|
| 107 |
+
HMMT 2026 Feb (Pass@1) 40.8 91.9 94.8 31.7 94.0 95.2
|
| 108 |
+
IMOAnswerBench (Pass@1) 41.9 85.1 88.4 35.3 88.0 89.8
|
| 109 |
+
Apex (Pass@1) 1.0 19.1 33.0 0.4 27.4 38.3
|
| 110 |
+
Apex Shortlist (Pass@1) 9.3 72.1 85.7 9.2 85.5 90.2
|
| 111 |
+
Long Context
|
| 112 |
+
MRCR 1M (MMR) 37.5 76.9 78.7 44.7 83.3 83.5
|
| 113 |
+
CorpusQA 1M (ACC) 15.5 59.3 60.5 35.6 56.5 62.0
|
| 114 |
+
Agentic
|
| 115 |
+
Terminal Bench 2.0 (Acc) 49.1 56.6 56.9 59.1 63.3 67.9
|
| 116 |
+
SWE Verified (Resolved) 73.7 78.6 79.0 73.6 79.4 80.6
|
| 117 |
+
SWE Pro (Resolved) 49.1 52.3 52.6 52.1 54.4 55.4
|
| 118 |
+
SWE Multilingual (Resolved) 69.7 70.2 73.3 69.8 74.1 76.2
|
| 119 |
+
BrowseComp (Pass@1) - 53.5 73.2 - 80.4 83.4
|
| 120 |
+
HLE w/ tools (Pass@1) - 40.3 45.1 - 44.7 48.2
|
| 121 |
+
MCPAtlas (Pass@1) 64.0 67.4 69.0 69.4 74.2 73.6
|
| 122 |
+
GDPval-AA (Elo) - - 1395 - - 1554
|
| 123 |
+
Toolathlon (Pass@1) 40.7 43.5 47.8 46.3 49.0 51.8
|
| 124 |
+
Chat Template
|
| 125 |
+
This release does not include a Jinja-format chat template. Instead, we provide a dedicated encoding folder with Python scripts and test cases demonstrating how to encode messages in OpenAI-compatible format into input strings for the model, and how to parse the model's text output. Please refer to the encoding folder for full documentation.
|
| 126 |
+
|
| 127 |
+
A brief example:
|
| 128 |
+
|
| 129 |
+
from encoding_dsv4 import encode_messages, parse_message_from_completion_text
|
| 130 |
+
|
| 131 |
+
messages = [
|
| 132 |
+
{"role": "user", "content": "hello"},
|
| 133 |
+
{"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."},
|
| 134 |
+
{"role": "user", "content": "1+1=?"}
|
| 135 |
+
]
|
| 136 |
+
|
| 137 |
+
# messages -> string
|
| 138 |
+
prompt = encode_messages(messages, thinking_mode="thinking")
|
| 139 |
+
|
| 140 |
+
# string -> tokens
|
| 141 |
+
import transformers
|
| 142 |
+
tokenizer = transformers.AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Pro")
|
| 143 |
+
tokens = tokenizer.encode(prompt)
|
| 144 |
+
|
| 145 |
+
How to Run Locally
|
| 146 |
+
Please refer to the inference folder for detailed instructions on running DeepSeek-V4 locally, including model weight conversion and interactive chat demos.
|
| 147 |
+
|
| 148 |
+
For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 1.0. For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens.
|
| 149 |
+
|
| 150 |
+
License
|
| 151 |
+
This repository and the model weights are licensed under the MIT License.
|
| 152 |
+
|
| 153 |
+
Citation
|
| 154 |
+
@misc{deepseekai2026deepseekv4,
|
| 155 |
+
title={DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
|
| 156 |
+
author={DeepSeek-AI},
|
| 157 |
+
year={2026},
|
| 158 |
+
}
|
| 159 |
+
|
| 160 |
+
Contact
|
| 161 |
+
If you have any questions, please raise an issue or contact us at service@deepseek.com.
|