README.md · canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP at main

DeepSeek-V4-Flash-NVFP4-FP8-MTP / README.md

pastapaul

Upload README.md with huggingface_hub

3af55f1 verified about 9 hours ago

preview code

raw

history blame contribute delete

21.8 kB

	---
	license: mit
	base_model: deepseek-ai/DeepSeek-V4-Flash
	base_model_relation: quantized
	language:
	- en
	- zh
	library_name: vllm
	pipeline_tag: text-generation
	tags:
	- deepseek
	- deepseek_v4
	- compressed-tensors
	- nvfp4
	- fp8
	- mtp
	- speculative-decoding
	- mixture-of-experts
	- moe
	- vllm
	---

	# canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP

	NVFP4 routed experts + FP8 block 128×128 attention + BF16 Multi-Token Prediction (MTP) draft head retained — same quantization math as [`RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8`](https://huggingface.co/RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8) but with the MTP block preserved in the saved weights so vLLM can load it with `--speculative-config method=mtp`.

	## TL;DR

	\| \| \|
	\|---\|---\|
	\| Recommended hardware \| 4× B300 TP=4 · or RTX PRO 6000 Blackwell at TP=2 (2 GPUs/replica) or TP=4 (4 GPUs/replica) — both validated \|
	\| Quality \| GSM8K 91.81% strict (8-shot); MMLU-Pro 81.13%; HumanEval pass@1 0.915 (EvalPlus) \|
	\| Throughput \| 278.68 output tok/s @ bs=1 chat-code on B300 TP=4 (2.13× vs RedHat NVFP4); RTX PRO 6000 94.6 @ TP=2 / 101.0 @ TP=4 at bs=1 \|
	\| MTP acceptance \| 87.96% on chat-code at bs=1 / k=2 — flat across bs=1 to bs=16 \|
	\| Spec-decode speedup \| 1.8–2.1× decode vs RedHat NVFP4 (workload-dependent) \|
	\| Differentiator \| Only V4-Flash NVFP4 quant where `--speculative-config method=mtp` actually fires — RedHat's artifact dropped MTP during calibration load \|

	## Family / related artifacts

	\| Repo \| Role \| Relation to this artifact \|
	\|---\|---\|---\|
	\| [`canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP`](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP) \| sibling \| W4A16 routed experts (Hopper-compatible), MTP retained — same MTP-preservation pattern. Note: on RTX PRO 6000 (SM 12.0) the W4A16 sibling's Marlin MoE decode path corrupts ~50% of generations under concurrent thinking-mode load. For batched thinking-mode workloads on SM 12.0, this NVFP4 artifact is the recommended choice. See [Card D's Honest limitations](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP#honest-limitations) and the [debug log](https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp/blob/main/docs/findings/sm12x_token_corruption_2026_05_24.md). \|
	\| [`canada-quant/DeepSeek-V4-Flash-W4A16-FP8`](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8) \| predecessor (no-MTP baseline) \| W4A16 + FP8 without MTP — broadest hardware compatibility \|
	\| [`canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP`](https://huggingface.co/canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP) \| larger sibling \| Same NVFP4 + MTP recipe applied to V4-Pro; B300-only deployment \|
	\| [`RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8`](https://huggingface.co/RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8) \| upstream reference \| Same quant math; MTP block dropped by `transformers` silent-strip (the bug this artifact fixes) \|

	## Why this exists

	The HF transformers DSV4 modeling class declares `_keys_to_ignore_on_load_unexpected = [r"(^\|\.)mtp\.."]`, which silently strips MTP keys during the calibration load path. RedHat's NVFP4-FP8 artifact ran through that path, so their saved weights don't include MTP — and serving cannot use V4-Flash's spec-decode head. This artifact patches the modeling class during calibration so MTP keys (`mtp.0.`, 799 tensors) survive at BF16. The result: an NVFP4 artifact that's structurally identical to RedHat's on the math, but loadable with `--speculative-config method=mtp` for ~2× decode speedup.

	## Architecture & precision

	### Base model

	\| Property \| Value \|
	\|---\|---\|
	\| Total parameters \| ~284 B (~13 B active per token) \|
	\| Decoder layers \| 43 \|
	\| Routed experts / layer \| 256 (top-K = 6) \|
	\| Hidden size \| 4096 \|
	\| Base BF16 size \| ~600 GB \|
	\| Quantized size \| 172 GB across 35 safetensors shards \|

	### Component precisions

	\| Component \| Format \| Method \|
	\|---\|---\|---\|
	\| Routed FFN experts (`w1`, `w2`, `w3` per expert) \| NVFP4 group=16 \| weight static + input dynamic "local" FP4 group=16, `nvfp4-pack-quantized` \|
	\| Attention path (`wq_a`, `wq_b`, `wkv`, `wo_a`, `wo_b` and fused) \| FP8_BLOCK 128×128 \| weight static + input dynamic FP8 group=128, `float-quantized` \|
	\| *MTP block (`mtp.0.`) \| BF16 \| Preserved verbatim (799 tensors)** \|
	\| `lm_head`, `embed_tokens`, norms, `ffn.gate`, `ffn.shared_experts`, attn `compressor`, attn `indexer`, `attn_sink`, `hc_*` \| BF16 \| Unquantized \|

	## Hardware validated

	\| Platform \| SM \| HBM/GPU \| Interconnect \| TP \| Role \|
	\|---\|---\|---\|---\|---\|---\|
	\| 4× NVIDIA B300 SXM6 AC \| 10.3, sm_103a \| 288 GB HBM3e \| NVLink \| 4 (TP=8 for BF16 reference) \| Primary — all accuracy + throughput numbers \|
	\| 4× NVIDIA RTX PRO 6000 Blackwell Server Edition \| 12.0, sm_120 \| 96 GB HBM \| PCIe \| TP=2 (2 GPUs, 2 replicas on a 4-GPU box) or TP=4 (4 GPUs, 1 replica) \| Also validated — both TP configs + GSM8K-50 cross-check, 3 extra patches \|

	Both platforms serve cuda graphs ON. Same artifact, no weight changes between SKUs.

	## Benchmarks

	### Quality (hardware-invariant — measured on B300)

	Measured 2026-05-21 on 4× B300 SXM6 AC (TP=4 for quant configs, TP=8 for BF16 reference which doesn't fit at TP=4). Greedy, temperature 0. The same artifact serves on RTX PRO 6000 Blackwell with no weight changes; GSM8K-50 cross-check: 88% strict TP=2 / 90% strict TP=4 on RTX PRO 6000 vs 91.81% strict full-set on B300 (within noise).

	\| Benchmark \| Setting \| This artifact \| BF16 + MTP reference \| `RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8` (no MTP) \|
	\|---\|---\|---\|---\|---\|
	\| AIME 2024 \| raw pass@1, thinking=high, max_tokens=65536 \| 25/30 = 83.33% \| 25/30 = 83.33% \| 27/30 = 90.00% \|
	\| AIME 2024 \| non-truncated pass@1 \| 24/25 = 96.00% \| 25/26 = 96.15% \| 27/28 = 96.43% \|
	\| AIME 2024 \| wall-clock for 30 problems @ bs=8 \| 476 s \| 490 s \| 1405 s \|
	\| GSM8K \| 8-shot, strict-match \| 0.9181 \| 0.9484 / 0.9522 (no-MTP / MTP) \| 0.910 (self-reported) \|
	\| GSM8K \| 8-shot, flexible-extract \| 0.9515 \| 0.9477 / 0.9515 \| not reported \|
	\| MMLU-Pro \| 5-shot, custom-extract \| 0.8113 \| not measured \| not reported \|
	\| HumanEval \| pass@1 (EvalPlus) \| 0.915 \| not measured \| 0.896 \|
	\| HumanEval+ \| pass@1 (EvalPlus) \| 0.848 \| not measured \| 0.860 \|
	\| IFEval \| prompt-strict (B300) \| 0.8540 \| not measured \| 0.8207 \|
	\| IFEval \| prompt-strict (RTX PRO 6000 TP=4, 2026-05-24, [JSON evidence](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/benchmarks/rtxpro6000/ifeval_2026_05_24.json)) \| 0.8429 (-1.1pp vs B300) \| — \| — \|

	On raw AIME pass@1, RedHat scores higher (27/30 vs ours 25/30) — but the gap is entirely truncation rate at the 65K max_tokens cap (RedHat truncated 2/30, ours 5/30). On non-truncated pass@1, all three configs are within 0.4 pt of each other (96.0–96.4%). Quantization quality is equivalent on AIME 2024; the differentiator is wall-clock.

	### Throughput

	#### 4× B300 SXM6 (sm_103a, NVLink, TP=4)

	Same hardware, same TP=4, same prompts as the quality table.

	\| Workload \| Operating point \| This artifact \| RedHat NVFP4 (no MTP) \| Ratio \|
	\|---\|---\|---\|---\|---\|
	\| AIME 2024 reasoning (thinking=high, bs=8) \| wall-clock for 30 problems \| 476 s \| 1405 s \| 2.95× \|
	\| AIME 2024 reasoning \| per-request median output tok/s \| 182.9 \| 99.6 \| 1.84× \|
	\| Coding (HumanEval chat, bs=1) \| output tok/s \| 278.68 \| 131.06 \| 2.13× \|
	\| Coding (HumanEval chat, bs=4) \| output tok/s \| 649.35 \| 417.87 \| 1.55× \|
	\| Coding (HumanEval chat, bs=8) \| output tok/s \| 1104.89 \| 673.12 \| 1.64× \|
	\| Coding (HumanEval chat, bs=16) \| output tok/s \| 1577.20 \| 1007.78 \| 1.56× \|

	Two ratios to disambiguate:
	- Pure decode throughput: at bs=1 chat coding, 2.13× faster. On AIME reasoning at bs=8, per-request median is 182.9 vs 99.6 tok/s — 1.84×. The decode ratio is workload-dependent (acceptance % varies) but lands in the 1.8–2.1× range across measured workloads.
	- AIME batch wall-clock: 1405 s / 476 s = 2.95×. This includes the truncation-rate differential at 65K — 5/30 of our responses truncated vs 2/30 of RedHat's, and truncated responses run to the cap, inflating RedHat's total wall-clock. The 2.95× number is "time to run AIME 2024 end-to-end," not "raw decode speed."

	#### 4× RTX PRO 6000 Blackwell (sm_120, PCIe, TP=2 and TP=4)

	Validated 2026-05-23 on a Brev `familiar-teal-worm` instance. Per-replica `vllm bench serve` random 256-in/256-out, `num_speculative_tokens=1` (SM 12.0 caps spec at k=1). MTP-on for all rows.

	\| Config \| bs=1 output tok/s \| bs=4 output tok/s \| bs=16 output tok/s \| bs=1 TPOT median \| MTP acceptance \| GSM8K-50 strict \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| TP=2 \| 94.6 \| 218.5 \| 360.5 \| 9.05 ms \| 70–73% \| 88% \|
	\| TP=4 \| 101.0 \| 254.0 \| 440.1 \| 8.20 ms \| 67–75% \| 90% \|

	At bs=16, TP=4 is 1.22× faster per-replica than TP=2 on this hardware — opposite of B300, where TP=4 beats TP=8 due to NVFP4 tensor-core underutilization. RTX PRO 6000's slower PCIe interconnect plus lower per-GPU compute means extra parallelism still pays off at all batch sizes measured.

	For context on the same RTX PRO 6000 box, the [W4A16-FP8-MTP sibling](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP) measured 98.83 tok/s at TP=2 bs=1 — equivalent decode throughput, with NVFP4 trading ~4% per-replica throughput for ~10% smaller on-disk footprint (172 GB vs 159 GB).

	##### AIME-2024 deep thinking-mode concurrency sweep (2026-05-25, TP=4)

	cuda graphs ON (capture sizes [1,2,4,8]), MTP `num_speculative_tokens=1`, `max-model-len=16384`. Bench JSONs at [`canada-quant/dsv4-flash-nvfp4-fp8-mtp/benchmarks/rtxpro6000/`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/tree/main/benchmarks/rtxpro6000).

	\| Concurrency \| Correct/30 \| Stop / Length \| Errors \| Wall (s) \| Problems/min \| MTP accept \| Speedup vs c=1 \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| c=1 (sequential) \| 24/30 (80.0%) \| 22 / 8 \| 0 \| 1453.9 \| 1.24 \| 90.61% \| 1.0× \|
	\| c=2 \| 23/30 (76.7%) \| 23 / 7 \| 0 \| 787.6 \| 2.29 \| 90.75% \| 1.85× \|
	\| c=4 \| 21/30 (70.0%) \| 20 / 10 \| 0 \| 386.6 \| 4.66 \| 90.93% \| 3.76× \|
	\| c=8 \| (terminated) \| — \| — \| — \| — \| — \| — \|

	Findings:
	- 0 errors and 0 stopped-but-wrong at c=1/2/4. Every wrong answer is length-truncated at `max_tokens`, not a quality issue — non-truncated pass@1 is essentially 100%.
	- MTP acceptance stable at 90.6–90.9% across c=1/c=2/c=4. The NVFP4 `flashinfer_trtllm` MoE backend on SM 12.0 is rock-solid under all tested concurrencies (unlike the W4A16 sibling's Marlin MoE path — see Card D for that story).
	- c=8 throughput collapse: TP=4 with no NVLink (PCIe-only) drops combined throughput from 450 t/s @ c=4 to ~38 t/s @ c=8 — a 12× per-request slowdown. MTP itself stayed healthy; the bottleneck is TP-allreduce communication over PCIe at high concurrency. Recommendation for higher aggregate throughput on RTX PRO 6000: run 2 replicas at TP=2 instead of 1 replica at TP=4 c=8.

	### MTP draft-token acceptance per workload (B300, bs=1, k=2)

	\| Workload \| Acceptance \|
	\|---\|---\|
	\| Random prompts (1024 in / 512 out) \| 10.75% \|
	\| Code, raw completion (HumanEval `/v1/completions`) \| 67.29% \|
	\| Code, chat-templated (HumanEval `/v1/chat/completions`, bs=1) \| 87.96% \|
	\| Code, chat-templated, bs=4 / bs=8 / bs=16 \| 88.27% / 87.92% / 88.19% \|
	\| Instruction following (IFEval) \| ~58.5% \|
	\| AIME 2024 reasoning (thinking=high) \| 81.60% \|

	Acceptance does not degrade under batching — flat at 88.0% ± 0.4% across bs=1 to bs=16 on chat-templated coding. RTX PRO 6000 acceptance lands in 67–75% on the random-prompt workload (256-in/256-out, not directly comparable to the workload-specific rows above).

	## Quick start

	One-line installer (applies all common patches):

	```bash
	curl -sL https://raw.githubusercontent.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/main/scripts/install_vllm_with_patches.sh \| bash
	```

	Serve with MTP spec-decode (B300):

	```bash
	CUDA_HOME=/usr/local/cuda VLLM_TEST_FORCE_FP8_MARLIN=1 \
	vllm serve canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP \
	--tensor-parallel-size 4 \
	--kv-cache-dtype fp8 \
	--speculative-config '{"method":"mtp","num_speculative_tokens":2}'
	```

	Without spec-decode:

	```bash
	CUDA_HOME=/usr/local/cuda VLLM_TEST_FORCE_FP8_MARLIN=1 \
	vllm serve canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP \
	--tensor-parallel-size 4 \
	--kv-cache-dtype fp8
	```

	Recommended TP:
	- B300: TP=4. TP=8 is slower than TP=4 at bs≥4 by up to 21.6% — per-rank MoE expert shards at TP=8 underutilize NVFP4 tensor-core kernels.
	- RTX PRO 6000: TP=4 with reduced cudagraph captures + `--max-num-seqs 8 --max-num-batched-tokens 2048` to fit memory. TP=2 also works; expect 1.22× lower per-replica throughput at bs=16.

	## Quantization recipe

	\| Property \| Value \|
	\|---\|---\|
	\| Dataset \| `HuggingFaceH4/ultrachat_200k` train_sft (V4 chat template) \|
	\| Samples \| 64 × max_seq_len 512 × batch_size 1, seed 42 \|
	\| Modifier class \| `QuantizationModifier` (not GPTQ — Hessian-reduce path hangs on multi-rank B300) \|
	\| Hardware \| calibration on B300 \|

	Calibration corpus is 12× smaller than RedHat's reference recipe (64 vs 768 samples). On the benchmarks measured, GSM8K / HumanEval / IFEval / MMLU-Pro / AIME-non-truncated all land within noise of the reference. The visible cost of reduced coverage is AIME truncation rate (5/30 vs RedHat's 2/30 at the 65K max_tokens cap), consistent with looser calibration scales producing less-converging reasoning trajectories. A v0.2 recipe with 768 samples is planned.

	\| Group \| Modules \| Scheme \| Format \|
	\|---\|---\|---\|---\|
	\| attention \| `wq_a, wq_b, wkv, wo_a, wo_b` (and fused variants) \| FP8_BLOCK 128×128, weight static + input dynamic FP8 group=128 \| `float-quantized` \|
	\| experts \| `w1, w2, w3` per expert \| NVFP4 group=16, weight static + input dynamic "local" FP4 group=16 \| `nvfp4-pack-quantized` \|
	\| ignored \| `lm_head`, `embed_tokens`, norms, `ffn.gate`, `ffn.shared_experts`, attn `compressor`, attn `indexer`, `attn_sink`, `hc_*` \| unquantized (BF16) \| n/a \|
	\| MTP block (`mtp.0.*`) \| all 799 keys \| unquantized (BF16, preserved verbatim) \| n/a \|

	## vLLM build

	### Common patches (all platforms)

	\| PR \| Purpose \| Status \|
	\|---\|---\|---\|
	\| [`vllm-project/vllm#43248`](https://github.com/vllm-project/vllm/pull/43248) \| `bool()` wrap on `is_static_input_scheme` \| open \|
	\| [`vllm-project/vllm#43288`](https://github.com/vllm-project/vllm/pull/43288) \| `.get("scale_fmt", "ue8m0")` on missing key + BF16 `getattr` follow-up \| open \|
	\| [`vllm-project/vllm#43290`](https://github.com/vllm-project/vllm/pull/43290) \| `weight_scale_inv`-or-`weight_scale` fallback \| open \|
	\| [`vllm-project/vllm#43319`](https://github.com/vllm-project/vllm/pull/43319) \| MTP-quant-detect from safetensors header + BF16 `wo_a` fallback path \| open \|

	The one-line installer applies all four automatically.

	### RTX PRO 6000 Blackwell (SM 12.0) only

	Three SM 12.0-specific patches required on top of the four common patches. Diffs in [`patches/sm120_*.diff`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/patches) in the source repo. Full rationale at [`docs/RECIPE_RTX6000PRO.md`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/docs/RECIPE_RTX6000PRO.md).

	1. `VLLM_TEST_FORCE_FP8_MARLIN=1` env var — bypasses the NVFP4 MoE backend selector's `swiglu_limit` filter (no `FLASHINFER_TRTLLM` NVFP4 kernel auto-selects on SM 12.0).
	2. `weight_scale_inv`-or-`weight_scale` fallback in Marlin's `scaled_mm/marlin.py` (PR #43290 covers `attention.py` only; SM 12.0 also hits Marlin's pre-process site).
	3. Skip Marlin pre-processing for layers tagged `is_bmm=True` — DSV4 `wo_a`/`wo_b`/`compressor.wkv` use the SM 12.0 Triton `fp8_einsum` kernel directly; Marlin's tile-layout repack breaks the original `(N, K)` layout the einsum expects.

	B300 deployments can skip all three.

	## Honest limitations

	1. AIME truncation rate at 65K — 5/30 of responses hit the cap on long reasoning traces vs RedHat's 2/30. Consistent with the 12×-smaller calibration corpus producing less-converging reasoning trajectories. Non-truncated pass@1 is at parity with RedHat. v0.2 with 768 samples planned.
	2. NVFP4 MoE backend selector on SM 12.0 — no `FLASHINFER_TRTLLM` kernel auto-selects, requires the `VLLM_TEST_FORCE_FP8_MARLIN=1` env var to route through Marlin. Native NVFP4 SM 12.0 kernels exist in upstream vLLM (`csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu`) but aren't picked by the backend selector ([`vllm-project/vllm#31085`](https://github.com/vllm-project/vllm/issues/31085)).
	3. k=1 cap on RTX PRO 6000 — SM 12.0 caps spec-decode at `num_speculative_tokens=1`; B300 supports k=2.
	4. AIME thinking acceptance @ 81.60% is lower than the chat-code 87.96% headline — workload-dependent, expected, called out for transparency.
	5. IFEval re-bench 2026-05-24 (RTX PRO 6000 TP=4) — close to published B300 numbers but slightly lower. A fresh `lm_eval ifeval --apply_chat_template num_concurrent=16` measurement on RTX PRO 6000 TP=4 (post-PR-#40923-rebuild) returned prompt_strict 0.8429, prompt_loose 0.8780, inst_strict 0.8945, inst_loose 0.9185 — within 0.6–1.5 pp of the published markdown numbers (0.8540 / 0.8928 / 0.9005 / 0.9293). The published numbers likely came from B300 (the primary benchmark platform); RTX PRO 6000 measurements are slightly lower but consistent. Raw JSON evidence now committed at [`benchmarks/rtxpro6000/ifeval_2026_05_24.json`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/benchmarks/rtxpro6000/ifeval_2026_05_24.json). Originally flagged as "no on-disk JSON evidence"; that gap is now closed.

	## Reproduction

	Full replication recipe at [`docs/recipes/nvfp4_fp8_mtp_replication.md`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/docs/recipes/nvfp4_fp8_mtp_replication.md) — covers the 14 gotchas (sm_103a vs sm_100a, calibration recipe, postprocess pipeline, vLLM build flags).

	## Upstream contributions filed during this work

	\| PR / Issue \| Description \| Status \|
	\|---\|---\|---\|
	\| [`vllm-project/vllm#43248`](https://github.com/vllm-project/vllm/pull/43248) \| `bool()` wrap on `is_static_input_scheme` \| open \|
	\| [`vllm-project/vllm#43288`](https://github.com/vllm-project/vllm/pull/43288) \| `.get("scale_fmt", "ue8m0")` defensive + BF16 follow-up \| open \|
	\| [`vllm-project/vllm#43290`](https://github.com/vllm-project/vllm/pull/43290) \| `weight_scale_inv`-or-`weight_scale` fallback \| open \|
	\| [`vllm-project/vllm#43319`](https://github.com/vllm-project/vllm/pull/43319) \| MTP-quant-detect from safetensors + BF16 `wo_a` fallback \| open \|
	\| [`vllm-project/vllm#43297`](https://github.com/vllm-project/vllm/issues/43297) \| `(1,)`-shape `global_scale` loader broadcast (issue) \| open \|
	\| [`vllm-project/vllm#43304`](https://github.com/vllm-project/vllm/issues/43304) \| MTP draft inherits main quant scheme (issue) \| partially addressed by #43319 \|
	\| [`vllm-project/llm-compressor#2745`](https://github.com/vllm-project/llm-compressor/issues/2745) \| MTP inference-mode crash \| open \|
	\| [`vllm-project/compressed-tensors#711`](https://github.com/vllm-project/compressed-tensors/issues/711) \| sharded-module load path \| open \|

	PR [`vllm-project/vllm#42209`](https://github.com/vllm-project/vllm/pull/42209) (sychen52, xinli-sw, pavanimajety, zyongye — NVIDIA) which added the DSV4 NVFP4 MoE kernel merged 2026-05-22; this artifact serves on top of that.

	## Changes

	\| Date \| Change \|
	\|---\|---\|
	\| 2026-05-21 \| Initial release on B300 — GSM8K 0.9181, HumanEval 0.915, IFEval 0.8540, MTP acceptance 87.96% on chat-code \|
	\| 2026-05-23 \| RTX PRO 6000 Blackwell (SM 12.0) validation added. TP=2 and TP=4 confirmed, MTP acceptance 67–75%, GSM8K-50 within noise of B300 \|
	\| 2026-05-24 \| Cross-card finding: AIME c=4 thinking-mode on RTX PRO 6000 shows this NVFP4 artifact produces 1/30 token-corrupted generations vs the [W4A16-MTP sibling](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP)'s 14/30 corrupted on the same hardware + vLLM build. The W4A16 sibling has a Marlin MoE decode race on SM 12.0; this NVFP4 artifact via `flashinfer_trtllm` MoE is the recommended deployment for batched thinking-mode on RTX PRO 6000. Filed upstream: [`jasl/vllm#12`](https://github.com/jasl/vllm/issues/12). \|

	## Files in the artifact

	- 35 sharded `model-*.safetensors` files + `model.safetensors.index.json` (172 GB total)
	- `config.json` — vLLM-compatible quantization_config with fused targets + W8A8 input_activations
	- `tokenizer.json`, `tokenizer_config.json`, `generation_config.json` — upstream DSV4-Flash
	- `chat_template.jinja` — upstream DSV4-Flash (unchanged)
	- `recipe.yaml` — the llm-compressor calibration recipe
	- `README.md` — this file

	## Citation

	```bibtex
	@misc{canada-quant-dsv4-flash-nvfp4-fp8-mtp-2026,
	title = {DeepSeek-V4-Flash NVFP4-FP8 with MTP preserved for vLLM speculative decoding},
	author = {Canada Quant},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP}
	}
	```

	## License

	MIT, inherited from upstream `deepseek-ai/DeepSeek-V4-Flash`.

	## Acknowledgments

	- DeepSeek for V4-Flash and the MTP architecture.
	- RedHat AI for the NVFP4-FP8 reference recipe.
	- PR [`#42209`](https://github.com/vllm-project/vllm/pull/42209) contributors (sychen52, xinli-sw, pavanimajety, zyongye) for the DSV4 NVFP4 MoE kernel work that made serving possible.
	- [`canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP`](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP) (W4A16 sibling) for the alias-injection pattern and MTP acceptance methodology.
	- vLLM, llm-compressor, compressed-tensors, FlashInfer maintainers.