DeepSeek-V4-Flash — W4A16 + FP8 + BF16 MTP (Option Y)
The first DeepSeek-V4-Flash quantization that preserves the Multi-Token Prediction (MTP) draft head at BF16 precision while quantizing the main routed-experts to W4A16 and the attention path to FP8_BLOCK. The MTP draft tower delivers a measured 1.49× decode speedup at single-user concurrency versus the same artifact served without speculative decoding — with zero quality regression on standard knowledge benchmarks vs the predecessor W4A16 quant.
| Component | Precision | Quantization recipe |
|---|---|---|
| Main routed experts (43 MoE layers × 256 experts × 3 projections) | W4A16 INT4 g=128 sym | GPTQ via llm-compressor, 768 calibration samples |
Attention path (wq_a, wq_b, wkv, wo_a, wo_b, indexer, compressor) |
FP8_BLOCK 128×128 | Dynamic scales, scale_fmt=ue8m0 |
MTP block (mtp.0.*) |
BF16 | Excluded from quantization (Option Y) |
HC plumbing (hc_attn_*, hc_ffn_*, hc_head_*), attn_sink, ffn.gate.bias, indexer/compressor ape |
FP32 | Restored post-save from BF16 source (see C13) |
head.weight (LM head) |
FP32 | Upcast from BF16 to match sibling artifact |
Vocab embedding (embed.weight, mtp.0.emb.tok_emb.weight) |
BF16 | Source dtype preserved |
Total artifact size: 159 GB (4 shards).
Why this exists
The predecessor canada-quant/DeepSeek-V4-Flash-W4A16-FP8 and the RedHat
RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 artifacts both drop the MTP block
because transformers 5.8.1's DeepseekV4PreTrainedModel declares:
_keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"]
which silently filters every mtp.* tensor at from_pretrained time —
without warning, without error. Quantization pipelines that go through
from_pretrained therefore produce W4A16/NVFP4 main weights paired with
an absent MTP block — and serving falls back to plain decode, losing the
~1.5–2× spec-decode speedup that DeepSeek-V4-Flash's architecture
provides.
This artifact bypasses that silent drop, runs the full 8-rank GPTQ calibration on a 768-sample dataset against the main routed experts, preserves the MTP block unquantized in BF16, and produces a serving artifact where speculative decoding actually fires.
Benchmarks (H200, vLLM, TP=2)
All numbers from the same Phase 2 artifact, same hardware (1 of 4 TP=2 pairs
on a p5en.48xlarge), same vLLM build (HEAD 50d9dd902 with PRs
#43248+#43288+#43290+#43319 cherry-picked).
Quality (standard benchmarks)
| Benchmark | Phase 2 (this repo) | Predecessor (W4A16-FP8, no MTP, HF card) | RedHat (NVFP4-FP8, no MTP) | Delta vs predecessor |
|---|---|---|---|---|
| GSM8K 8-shot strict-match | 93.71% ± 0.67 | 94.99% (phase4e, 8-shot strict) | 91.0% | -1.28 pts (within SE) |
| GSM8K 8-shot flexible-extract | 93.63% ± 0.67 | — | — | predecessor HF card cites 5-shot flex at 92.87%; not directly comparable to our 8-shot |
| MMLU 5-shot | 86.88% ± 0.27 | 87.27% | — | -0.39 pts (within SE) |
| MMLU-Pro 5-shot (12k prompts, custom-extract) | 71.28% ± 0.40 | — | — | sibling NVFP4-FP8-MTP scored 81.13% (B300) → -9.85 pts vs sibling; expected given W4A16 has more quantization noise than NVFP4 on a knowledge-heavy harder benchmark |
| HumanEval pass@1 0-shot instruct | 84.76% ± 2.82 | 54.27% (strict-regex artifact) | — | predecessor's 54.27% reflects a strict-regex extraction artifact (per predecessor notes); our 84.76% uses lm-eval-harness's default flexible code-block extraction, which is what evaluate-metric/code_eval actually executes |
| AIME 24 (math competition, 30 problems) | 30.0% exact_match ± 8.51 | — | — | sibling published tier1_aime24_2026_05_21.md; competition math at high difficulty |
| Chat-smoke quick (4 deterministic) | 4/4 | 4/4 | — | match |
| Chat-smoke quality (4 writing/translation) | 4/4 | 4/4 | — | match |
| Chat-smoke coding (2 HTML/code) | 2/2 | 2/2 | — | match |
| toolcall15 | 24/30 (80%) | 26/30 (87%) | — | -2 pts |
MTP-specific (Option Y differentiator)
| Metric | Phase 2 | Sibling NVFP4-FP8-MTP (published) | Notes |
|---|---|---|---|
| MTP draft-token acceptance (random 256-token prompts, c=1, k=1, 200 samples) | 69.94% (21024 / 30058) | 67.29% (HumanEval raw code, c=1, k=2) | Direct comparison — different prompt distribution but same metric class |
| MTP acceptance by workload — code (15 raw-completion prompts) | 92.91% (1847 / 1988) | 67.29% (sibling raw HumanEval) | Sibling's HumanEval prompts are full multi-line function bodies; ours are short signature+docstring prompts — predictable continuation pattern bumps acceptance |
| MTP acceptance by workload — chat-templated prose (15 prompts) | 81.90% (1946 / 2376) | 85.04% (sibling EvalPlus HumanEval c=16) | Both numbers fall in the chat-templated 80-85% band sibling documented |
| MTP acceptance by workload — raw natural language (15 continuation prompts) | 83.65% (1745 / 2086) | — | New measurement |
| Decode TPOT median, bs=1, k=1, MTP-spec | 6.02 ms | — | Single-user decode latency per output token |
| Decode TPOT median, bs=1, no spec-decode | 8.93 ms | — | Same artifact, spec-decode disabled |
| Decode speedup bs=1 (k=1, vs no-spec) | 1.49× | 2.03× (sibling at k=2) | Sibling used k=2; we hit DeepGemm kernel ceiling at k=1 (see C15) |
Spec-decode wins at low concurrency (single-user interactive workload).
At bs=4/16, the verifier model is already filling its batch lane, so the
extra verifier passes add overhead without saving wall-clock — matching
sibling's published methodology framing of c=1 as the headline
operating point.
toolcall15 -2 pts explained honestly
The two regressions vs predecessor are model-routing decisions, not parser/quant artifacts:
- TC-06 "Multi-Value Extraction": asked to translate one phrase to
two languages; predecessor issued two
translatetool calls, Phase 2 returned both translations as plain content text without routing to tools. Net effect: same task completed, different execution path. Not a parser failure (confirmed by replaying through--tool-call-parser deepseek_v4). - TC-07 "Search Read Act": Phase 2 issued the first two tool calls
(
search_files,read_file) correctly but stopped mid-chain to ask the user a clarifying question instead of carrying the result forward into a third call. Predecessor completed the chain end-to-end.
Both regressions are conservatism in chain-completion + tool-selection heuristics. Quality-wise the model still completed the underlying user intent; the harness scores tool-call-protocol fidelity, not task completion. No evidence of a parser config issue.
Three upstream contributions surfaced during this work
These bugs were diagnosed during the build and are filed in
FINDINGS_FOR_SIBLING.md for upstream PRs:
C13 — transformers.save_pretrained silently downcasts FP32 to BF16
417 tensors specified as FP32 in DeepSeek's release spec (HC plumbing,
gate bias, attn_sink, indexer/compressor ape) are silently written as
BF16 by transformers.save_pretrained when the model's torch_dtype is
BF16. No warning, no error. Workaround: postprocess restore from the
BF16 source. The
scripts/fixup_artifact.py pipeline does
this in one atomic per-shard pass. Worth filing upstream against
transformers — the save path should preserve per-tensor dtype.
C14 — vLLM MTP loader silently skips top-level head.weight + embed.weight
vllm.models.deepseek_v4.nvidia.mtp.DeepSeekV4MTP.load_weights calls
name.replace("mtp.0.", "") which no-ops on non-mtp.0.* keys, then
get_spec_layer_idx returns None → the loop hits continue and the
weight is skipped. Top-level head.weight and embed.weight never
reach the MTP layer's shared_head.head / embed_tokens, leaving
those parameters uninitialized → garbage logits → 0% MTP draft
acceptance with no load-time error.
Workaround: postprocess injects mtp.0.head.weight (FP32 copy of
top-level head, matching sibling artifact's pattern) and
mtp.0.emb.tok_emb.weight (BF16 copy of top-level embed) as full
duplicates. Worth filing upstream against vLLM — the loader should
either route top-level keys to the MTP slot or raise at construction
time when shared_head.head is uninitialized.
C15 — DeepGemm paged_mqa_logits kernel asserts on num_speculative_tokens > 1
vllm serve --speculative-config method=mtp,num_speculative_tokens=2
crashes during profile_cudagraph_memory with:
Assertion error (smxx_fp8_fp4_paged_mqa_logits.hpp:233):
next_n == 1 or next_n == 2
vLLM passes next_n = num_speculative_tokens + 1 into the DeepGemm
kernel (k draft + 1 main verifier in the lookahead window). The
assertion enforces num_speculative_tokens <= 1 in practice. The
FLASHINFER_MLA_SPARSE attention backend hits the same assertion
(kernel is logits-side, not attention-backend-specific).
This caps our practical k at 1, leaving headroom on the speedup. With k=2 unlocked the bs=1 decode speedup should land closer to sibling's published 2.03×.
Reproducing this artifact
The full pipeline is committed in this repo. From a fresh 8× H200 box:
# Phase 0 — bootstrap (venv-calib + venv-serve + vendor + apply patches)
bash scripts/bootstrap_p5en_h200.sh
# Phase 1 — download upstream + dequant to BF16-MTP source
# (writes /scratch/weights/bf16-mtp/, ~660 GB, ~30 min)
bash scripts/phase1_dequant.sh # (called from bootstrap)
# Phase 2 — GPTQ calibration (8 ranks, ~15h wall)
bash scripts/run_phase2.sh
# (Equivalently: torchrun --nproc-per-node=8 scripts/quantize_v4_w4a16_mtp.py
# --input /scratch/weights/bf16-mtp --output /scratch/weights/w4a16-fp8-mtp-gptq
# --samples 768 --batch-size 4 --max-seq-len 512
# --checkpoint-dir /scratch/weights/checkpoints-phase2)
# Phase 3 — postprocess (rename + config patch + FP32 restore + MTP aliases)
bash scripts/postprocess_phase2.sh
# (Runs: rename_to_upstream.py → postprocess_for_vllm.py
# → pass2_rename.py (indexer/compressor nesting fix)
# → fixup_artifact.py (FP32 restore + MTP head/embed aliases))
# Phase 4 — verify
python scripts/verify_option_y.py /scratch/weights/w4a16-fp8-mtp-gptq
# Phase 5 — serve
vllm serve /scratch/weights/w4a16-fp8-mtp-gptq \
--tensor-parallel-size 2 \
--kv-cache-dtype fp8 --block-size 256 \
--max-model-len 4096 \
--gpu-memory-utilization 0.80 \
--no-enable-prefix-caching \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 --enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--trust-remote-code
See SUPERVISION_RULES.md for the discipline
practices that emerged during the build (atomic safetensors writes,
verify-before-acting on summaries, subagent briefing standards).
Hardware
Built on AWS p5en.48xlarge (8× H200 SXM5, Hopper SM 9.0a, 143 GB
HBM3e per GPU). vLLM serve uses TP=2 (2 GPUs per replica) per the
sibling artifact's published guidance — TP=4 underutilizes Marlin
tensor cores on per-rank expert shards for our W4A16 layout.
- Phase 2 GPTQ calibration: 15.09h oneshot + ~16 min save = ~15.4h end-to-end on 8× H200. Per-subgraph cadence stabilized at ~20 min/subgraph across 44 subgraphs (43 MoE main + 1 MTP — MTP subgraph is a no-op per Option Y).
- Phase 2 GPTQ output: 4 shards, 159 GB total.
- Postprocess: ~6 min wall (single-process, 4 shards rewritten atomically with FP32 restore + alias injection).
Honest limitations
- k=1 cap on spec-decode due to C15 — current vLLM build limits
num_speculative_tokensto 1 on H200/Hopper. With C15 fixed, expect speedup to rise from 1.49× to ~1.85× at bs=1. - toolcall15 -2 pts vs predecessor — model-routing regressions on chain-completion + multi-tool extraction. See breakdown above. Not a parser issue.
- GSM8K -1.3 pts vs predecessor's 8-shot strict-match — within
one standard-error band, but technically below. Predecessor's
calibration ran on Spark; ours ran on H200 with the same recipe
(
scripts/quantize_v4_w4a16_mtp.pymatches their Phase 2 invocation modulo hardware). Likely calibration-set sensitivity; not a recipe bug.
Credits + reproducibility
- DeepSeek for the base model + inference reference.
- jasl (
jasl/vllmandjasl/vllm-ds4-sm120-harness) for the working vLLM build pin (ds4-sm120-experimentalbranch) and the benchmark harness. canada-quant/DeepSeek-V4-Flash-W4A16-FP8(predecessor) for the proven recipe topology that this artifact extends with MTP.canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP(sibling) for the alias-injection pattern + MTP acceptance methodology.
Source repo, scripts, recipe, every patch: https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp
License
MIT. See LICENSE in the source repo. Base model is licensed per
DeepSeek's terms — review at the upstream deepseek-ai/DeepSeek-V4-Flash
repo before commercial deployment.
- Downloads last month
- 13
Model tree for canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP
Base model
deepseek-ai/DeepSeek-V4-Flash