Technical Report Summary

#129
by mishig - opened

DeepSeek-V4

Towards Highly Efficient Million-Token Context Intelligence

Preview Release · DeepSeek-AI · Model checkpoints

You can find the full technical report here.


TL;DR — Two new open MoE models (V4-Pro 1.6T/49B-active, V4-Flash 284B/13B-active) that run native 1M-token context at a fraction of the compute and memory of DeepSeek-V3.2. The big contributions are architectural: a hybrid CSA + HCA attention stack, Manifold-Constrained Hyper-Connections (mHC), and the Muon optimizer at 1.6T scale.

Metric Value
V4-Pro FLOPs vs V3.2 @ 1M ctx 27%
V4-Pro KV cache vs V3.2 10%
V4-Flash FLOPs / KV cache vs V3.2 10% / 7%
Pre-training tokens 32–33T

Why it matters

  • Million-token context at practical cost. The quadratic-attention wall is the real bottleneck on test-time scaling — dropping 1M-context inference to ~10% of prior KV-cache footprint makes long-horizon agentic and multi-document workloads economically routine rather than prohibitive.
  • Frontier-class open weights. V4-Pro-Max beats GPT-5.2 and Gemini-3.0-Pro on reasoning, trails GPT-5.4 / Gemini-3.1-Pro by ~3–6 months, and on internal agent evals surpasses Claude Sonnet 4.5 while approaching Opus 4.5.
  • Unlocks the next scaling regime. The authors frame this explicitly: efficient ultra-long sequences are the foundation for further test-time scaling and for future paradigms like online learning.

image

Figure 1 — Left: V4-Pro-Max vs Claude-Opus-4.6, GPT-5.4, Gemini-3.1 on knowledge, reasoning, and agentic benchmarks. Right: single-token inference FLOPs and accumulated KV-cache size vs DeepSeek-V3.2 out to 1M tokens.

What is actually new

1. Hybrid attention: CSA + HCA

The headline architectural change. Two complementary attention variants are interleaved:

  • Compressed Sparse Attention (CSA) — compresses every m KV tokens into one entry via learned compression weights, then applies DeepSeek Sparse Attention (top-k selection via a Lightning Indexer) plus a sliding window for local detail.
  • Heavily Compressed Attention (HCA) — same compression idea but far more aggressive (m′ ≫ m) with dense attention over the compressed stream. Interleaving CSA/HCA layers is what makes 1M context tractable.

image

Figure 3 — CSA compresses KV entries m-to-1, then uses a Lightning Indexer to top-k select compressed blocks; a sliding window preserves local fine-grained dependencies.

2. Manifold-Constrained Hyper-Connections (mHC)

Upgrades residual connections by projecting the residual mapping matrix B_l onto the Birkhoff polytope (doubly stochastic matrices) via Sinkhorn-Knopp iteration. This bounds the spectral norm ≤ 1, keeping the transformation non-expansive. Fixes the numerical instability that prevented stacking plain Hyper-Connections deeply — a real contribution to residual-stream design.

3. Muon optimizer at 1.6T scale

First deployment of Muon on a trillion-plus MoE, paired with a custom hybrid ZeRO strategy. Reported faster convergence and better stability than AdamW-class baselines at this size.

4. FP4 quantization-aware training

Not just inference quantization — FP4 QAT is applied to MoE expert weights and the indexer QK path during training itself. FP4×FP8 GEMMs can be up to ⅓ faster on future hardware.

image

Figure 2 — Overall V4 block. Hybrid CSA/HCA attention, DeepSeekMoE feed-forward, mHC-strengthened residual connections, and MTP heads at the output.

5. Infrastructure firsts

  • Single fused kernel for MoE that overlaps compute, communication, and memory access simultaneously.
  • TileLang — DSL balancing kernel productivity vs efficiency.
  • Batch-invariant deterministic kernel library — bitwise reproducibility across training and inference.
  • Two-stage contextual parallelism for compressed attention.
  • Heterogeneous KV-cache structure with on-disk storage for shared-prefix reuse.

6. Post-training: specialist-then-distill

Train independent SFT + GRPO experts per domain (math, code, agent, instruction-following), then consolidate into one model via on-policy distillation with reverse-KL loss. A cleaner formulation of the "many specialists → one generalist" recipe.

7. Architectural housekeeping worth noting

  • MoE affinity scoring: SigmoidSqrt(Softplus).
  • Removed cap on routing target nodes for MoE.
  • Hash-routed MoE replaces dense FFNs in the earliest transformer blocks.

Bottom line

The importance of the paper is the efficiency-per-context-token curve, not raw capability — V4 makes 1M-token reasoning economically viable on open weights. The novelty concentrates in two places:

  1. The CSA/HCA hybrid attention scheme for ultra-long contexts.
  2. mHC's manifold constraint stabilizing deep residual architectures.

The remainder is strong engineering consolidation: Muon at scale, FP4-QAT, deterministic kernels, on-policy distillation. Capability-wise V4-Pro-Max appears to land ~3–6 months behind the leading closed frontier models, while leading the open-weights pack on agentic and long-context tasks.

Sign up or log in to comment