Technical Report Summary
DeepSeek-V4
Towards Highly Efficient Million-Token Context Intelligence
Preview Release · DeepSeek-AI · Model checkpoints
You can find the full technical report here.
TL;DR — Two new open MoE models (V4-Pro 1.6T/49B-active, V4-Flash 284B/13B-active) that run native 1M-token context at a fraction of the compute and memory of DeepSeek-V3.2. The big contributions are architectural: a hybrid CSA + HCA attention stack, Manifold-Constrained Hyper-Connections (mHC), and the Muon optimizer at 1.6T scale.
| Metric | Value |
|---|---|
| V4-Pro FLOPs vs V3.2 @ 1M ctx | 27% |
| V4-Pro KV cache vs V3.2 | 10% |
| V4-Flash FLOPs / KV cache vs V3.2 | 10% / 7% |
| Pre-training tokens | 32–33T |
Why it matters
- Million-token context at practical cost. The quadratic-attention wall is the real bottleneck on test-time scaling — dropping 1M-context inference to ~10% of prior KV-cache footprint makes long-horizon agentic and multi-document workloads economically routine rather than prohibitive.
- Frontier-class open weights. V4-Pro-Max beats GPT-5.2 and Gemini-3.0-Pro on reasoning, trails GPT-5.4 / Gemini-3.1-Pro by ~3–6 months, and on internal agent evals surpasses Claude Sonnet 4.5 while approaching Opus 4.5.
- Unlocks the next scaling regime. The authors frame this explicitly: efficient ultra-long sequences are the foundation for further test-time scaling and for future paradigms like online learning.
Figure 1 — Left: V4-Pro-Max vs Claude-Opus-4.6, GPT-5.4, Gemini-3.1 on knowledge, reasoning, and agentic benchmarks. Right: single-token inference FLOPs and accumulated KV-cache size vs DeepSeek-V3.2 out to 1M tokens.
What is actually new
1. Hybrid attention: CSA + HCA
The headline architectural change. Two complementary attention variants are interleaved:
- Compressed Sparse Attention (CSA) — compresses every m KV tokens into one entry via learned compression weights, then applies DeepSeek Sparse Attention (top-k selection via a Lightning Indexer) plus a sliding window for local detail.
- Heavily Compressed Attention (HCA) — same compression idea but far more aggressive (m′ ≫ m) with dense attention over the compressed stream. Interleaving CSA/HCA layers is what makes 1M context tractable.
Figure 3 — CSA compresses KV entries m-to-1, then uses a Lightning Indexer to top-k select compressed blocks; a sliding window preserves local fine-grained dependencies.
2. Manifold-Constrained Hyper-Connections (mHC)
Upgrades residual connections by projecting the residual mapping matrix B_l onto the Birkhoff polytope (doubly stochastic matrices) via Sinkhorn-Knopp iteration. This bounds the spectral norm ≤ 1, keeping the transformation non-expansive. Fixes the numerical instability that prevented stacking plain Hyper-Connections deeply — a real contribution to residual-stream design.
3. Muon optimizer at 1.6T scale
First deployment of Muon on a trillion-plus MoE, paired with a custom hybrid ZeRO strategy. Reported faster convergence and better stability than AdamW-class baselines at this size.
4. FP4 quantization-aware training
Not just inference quantization — FP4 QAT is applied to MoE expert weights and the indexer QK path during training itself. FP4×FP8 GEMMs can be up to ⅓ faster on future hardware.
Figure 2 — Overall V4 block. Hybrid CSA/HCA attention, DeepSeekMoE feed-forward, mHC-strengthened residual connections, and MTP heads at the output.
5. Infrastructure firsts
- Single fused kernel for MoE that overlaps compute, communication, and memory access simultaneously.
- TileLang — DSL balancing kernel productivity vs efficiency.
- Batch-invariant deterministic kernel library — bitwise reproducibility across training and inference.
- Two-stage contextual parallelism for compressed attention.
- Heterogeneous KV-cache structure with on-disk storage for shared-prefix reuse.
6. Post-training: specialist-then-distill
Train independent SFT + GRPO experts per domain (math, code, agent, instruction-following), then consolidate into one model via on-policy distillation with reverse-KL loss. A cleaner formulation of the "many specialists → one generalist" recipe.
7. Architectural housekeeping worth noting
- MoE affinity scoring:
Sigmoid→Sqrt(Softplus). - Removed cap on routing target nodes for MoE.
- Hash-routed MoE replaces dense FFNs in the earliest transformer blocks.
Bottom line
The importance of the paper is the efficiency-per-context-token curve, not raw capability — V4 makes 1M-token reasoning economically viable on open weights. The novelty concentrates in two places:
- The CSA/HCA hybrid attention scheme for ultra-long contexts.
- mHC's manifold constraint stabilizing deep residual architectures.
The remainder is strong engineering consolidation: Muon at scale, FP4-QAT, deterministic kernels, on-policy distillation. Capability-wise V4-Pro-Max appears to land ~3–6 months behind the leading closed frontier models, while leading the open-weights pack on agentic and long-context tasks.


