| --- |
| license: mit |
| base_model: deepseek-ai/DeepSeek-V4-Flash |
| base_model_relation: quantized |
| language: |
| - en |
| - zh |
| library_name: vllm |
| pipeline_tag: text-generation |
| tags: |
| - deepseek |
| - deepseek_v4 |
| - compressed-tensors |
| - nvfp4 |
| - fp8 |
| - mtp |
| - speculative-decoding |
| - mixture-of-experts |
| - moe |
| - vllm |
| --- |
| |
| # canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP |
|
|
| NVFP4 routed experts + FP8 block 128×128 attention + **BF16 Multi-Token Prediction (MTP) draft head retained** — same quantization math as [`RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8`](https://huggingface.co/RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8) but with the MTP block preserved in the saved weights so vLLM can load it with `--speculative-config method=mtp`. |
|
|
| ## TL;DR |
|
|
| | | | |
| |---|---| |
| | **Recommended hardware** | 4× B300 TP=4 · or RTX PRO 6000 Blackwell at **TP=2 (2 GPUs/replica)** or **TP=4 (4 GPUs/replica)** — both validated | |
| | **Quality** | GSM8K 91.81% strict (8-shot); MMLU-Pro 81.13%; HumanEval pass@1 0.915 (EvalPlus) | |
| | **Throughput** | 278.68 output tok/s @ bs=1 chat-code on B300 TP=4 (2.13× vs RedHat NVFP4); RTX PRO 6000 **94.6 @ TP=2** / **101.0 @ TP=4** at bs=1 | |
| | **MTP acceptance** | 87.96% on chat-code at bs=1 / k=2 — flat across bs=1 to bs=16 | |
| | **Spec-decode speedup** | **1.8–2.1× decode** vs RedHat NVFP4 (workload-dependent) | |
| | **Differentiator** | Only V4-Flash NVFP4 quant where `--speculative-config method=mtp` actually fires — RedHat's artifact dropped MTP during calibration load | |
|
|
| ## Family / related artifacts |
|
|
| | Repo | Role | Relation to this artifact | |
| |---|---|---| |
| | [`canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP`](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP) | sibling | W4A16 routed experts (Hopper-compatible), MTP retained — same MTP-preservation pattern. **Note**: on RTX PRO 6000 (SM 12.0) the W4A16 sibling's Marlin MoE decode path corrupts ~50% of generations under concurrent thinking-mode load. **For batched thinking-mode workloads on SM 12.0, this NVFP4 artifact is the recommended choice.** See [Card D's Honest limitations](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP#honest-limitations) and the [debug log](https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp/blob/main/docs/findings/sm12x_token_corruption_2026_05_24.md). | |
| | [`canada-quant/DeepSeek-V4-Flash-W4A16-FP8`](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8) | predecessor (no-MTP baseline) | W4A16 + FP8 without MTP — broadest hardware compatibility | |
| | [`canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP`](https://huggingface.co/canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP) | larger sibling | Same NVFP4 + MTP recipe applied to V4-Pro; B300-only deployment | |
| | [`RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8`](https://huggingface.co/RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8) | upstream reference | Same quant math; MTP block dropped by `transformers` silent-strip (the bug this artifact fixes) | |
|
|
| ## Why this exists |
|
|
| The HF transformers DSV4 modeling class declares `_keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"]`, which silently strips MTP keys during the calibration load path. RedHat's NVFP4-FP8 artifact ran through that path, so their saved weights don't include MTP — and serving cannot use V4-Flash's spec-decode head. This artifact patches the modeling class during calibration so MTP keys (`mtp.0.*`, 799 tensors) survive at BF16. The result: an NVFP4 artifact that's structurally identical to RedHat's on the math, but loadable with `--speculative-config method=mtp` for ~2× decode speedup. |
|
|
| ## Architecture & precision |
|
|
| ### Base model |
|
|
| | Property | Value | |
| |---|---| |
| | Total parameters | ~284 B (~13 B active per token) | |
| | Decoder layers | 43 | |
| | Routed experts / layer | 256 (top-K = 6) | |
| | Hidden size | 4096 | |
| | Base BF16 size | ~600 GB | |
| | Quantized size | **172 GB** across 35 safetensors shards | |
|
|
| ### Component precisions |
|
|
| | Component | Format | Method | |
| |---|---|---| |
| | Routed FFN experts (`w1`, `w2`, `w3` per expert) | NVFP4 group=16 | weight static + input dynamic "local" FP4 group=16, `nvfp4-pack-quantized` | |
| | Attention path (`wq_a`, `wq_b`, `wkv`, `wo_a`, `wo_b` and fused) | FP8_BLOCK 128×128 | weight static + input dynamic FP8 group=128, `float-quantized` | |
| | **MTP block (`mtp.0.*`)** | **BF16** | **Preserved verbatim (799 tensors)** | |
| | `lm_head`, `embed_tokens`, norms, `ffn.gate`, `ffn.shared_experts`, attn `compressor`, attn `indexer`, `attn_sink`, `hc_*` | BF16 | Unquantized | |
| |
| ## Hardware validated |
| |
| | Platform | SM | HBM/GPU | Interconnect | TP | Role | |
| |---|---|---|---|---|---| |
| | 4× NVIDIA B300 SXM6 AC | 10.3, sm_103a | 288 GB HBM3e | NVLink | 4 (TP=8 for BF16 reference) | Primary — all accuracy + throughput numbers | |
| | 4× NVIDIA RTX PRO 6000 Blackwell Server Edition | 12.0, sm_120 | 96 GB HBM | PCIe | **TP=2 (2 GPUs, 2 replicas on a 4-GPU box) or TP=4 (4 GPUs, 1 replica)** | Also validated — both TP configs + GSM8K-50 cross-check, 3 extra patches | |
| |
| Both platforms serve cuda graphs ON. Same artifact, no weight changes between SKUs. |
| |
| ## Benchmarks |
| |
| ### Quality (hardware-invariant — measured on B300) |
| |
| Measured 2026-05-21 on 4× B300 SXM6 AC (TP=4 for quant configs, TP=8 for BF16 reference which doesn't fit at TP=4). Greedy, temperature 0. The same artifact serves on RTX PRO 6000 Blackwell with no weight changes; GSM8K-50 cross-check: 88% strict TP=2 / 90% strict TP=4 on RTX PRO 6000 vs 91.81% strict full-set on B300 (within noise). |
| |
| | Benchmark | Setting | This artifact | BF16 + MTP reference | `RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8` (no MTP) | |
| |---|---|---|---|---| |
| | AIME 2024 | raw pass@1, thinking=high, max_tokens=65536 | 25/30 = 83.33% | 25/30 = 83.33% | 27/30 = 90.00% | |
| | AIME 2024 | non-truncated pass@1 | 24/25 = 96.00% | 25/26 = 96.15% | 27/28 = 96.43% | |
| | AIME 2024 | wall-clock for 30 problems @ bs=8 | **476 s** | 490 s | 1405 s | |
| | GSM8K | 8-shot, strict-match | 0.9181 | 0.9484 / 0.9522 (no-MTP / MTP) | 0.910 (self-reported) | |
| | GSM8K | 8-shot, flexible-extract | 0.9515 | 0.9477 / 0.9515 | not reported | |
| | MMLU-Pro | 5-shot, custom-extract | 0.8113 | not measured | not reported | |
| | HumanEval | pass@1 (EvalPlus) | **0.915** | not measured | 0.896 | |
| | HumanEval+ | pass@1 (EvalPlus) | 0.848 | not measured | 0.860 | |
| | IFEval | prompt-strict (B300) | **0.8540** | not measured | 0.8207 | |
| | IFEval | prompt-strict (RTX PRO 6000 TP=4, 2026-05-24, [JSON evidence](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/benchmarks/rtxpro6000/ifeval_2026_05_24.json)) | **0.8429** (-1.1pp vs B300) | — | — | |
| |
| On raw AIME pass@1, RedHat scores higher (27/30 vs ours 25/30) — but the gap is **entirely truncation rate** at the 65K max_tokens cap (RedHat truncated 2/30, ours 5/30). On non-truncated pass@1, all three configs are within 0.4 pt of each other (96.0–96.4%). Quantization quality is equivalent on AIME 2024; the differentiator is wall-clock. |
| |
| ### Throughput |
| |
| #### 4× B300 SXM6 (sm_103a, NVLink, TP=4) |
| |
| Same hardware, same TP=4, same prompts as the quality table. |
| |
| | Workload | Operating point | This artifact | RedHat NVFP4 (no MTP) | Ratio | |
| |---|---|---|---|---| |
| | AIME 2024 reasoning (thinking=high, bs=8) | wall-clock for 30 problems | 476 s | 1405 s | **2.95×** | |
| | AIME 2024 reasoning | per-request median output tok/s | 182.9 | 99.6 | **1.84×** | |
| | Coding (HumanEval chat, bs=1) | output tok/s | **278.68** | 131.06 | 2.13× | |
| | Coding (HumanEval chat, bs=4) | output tok/s | 649.35 | 417.87 | 1.55× | |
| | Coding (HumanEval chat, bs=8) | output tok/s | 1104.89 | 673.12 | 1.64× | |
| | Coding (HumanEval chat, bs=16) | output tok/s | 1577.20 | 1007.78 | 1.56× | |
| |
| Two ratios to disambiguate: |
| - **Pure decode throughput**: at bs=1 chat coding, 2.13× faster. On AIME reasoning at bs=8, per-request median is 182.9 vs 99.6 tok/s — **1.84×**. The decode ratio is workload-dependent (acceptance % varies) but lands in the **1.8–2.1×** range across measured workloads. |
| - **AIME batch wall-clock**: 1405 s / 476 s = **2.95×**. This includes the truncation-rate differential at 65K — 5/30 of our responses truncated vs 2/30 of RedHat's, and truncated responses run to the cap, inflating RedHat's total wall-clock. The 2.95× number is "time to run AIME 2024 end-to-end," not "raw decode speed." |
| |
| #### 4× RTX PRO 6000 Blackwell (sm_120, PCIe, TP=2 and TP=4) |
| |
| Validated 2026-05-23 on a Brev `familiar-teal-worm` instance. Per-replica `vllm bench serve` random 256-in/256-out, `num_speculative_tokens=1` (SM 12.0 caps spec at k=1). MTP-on for all rows. |
| |
| | Config | bs=1 output tok/s | bs=4 output tok/s | bs=16 output tok/s | bs=1 TPOT median | MTP acceptance | GSM8K-50 strict | |
| |---|---|---|---|---|---|---| |
| | TP=2 | 94.6 | 218.5 | 360.5 | 9.05 ms | 70–73% | 88% | |
| | TP=4 | **101.0** | 254.0 | **440.1** | **8.20 ms** | 67–75% | 90% | |
| |
| At bs=16, TP=4 is 1.22× faster per-replica than TP=2 on this hardware — opposite of B300, where TP=4 beats TP=8 due to NVFP4 tensor-core underutilization. RTX PRO 6000's slower PCIe interconnect plus lower per-GPU compute means extra parallelism still pays off at all batch sizes measured. |
| |
| For context on the same RTX PRO 6000 box, the [W4A16-FP8-MTP sibling](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP) measured 98.83 tok/s at TP=2 bs=1 — equivalent decode throughput, with NVFP4 trading ~4% per-replica throughput for ~10% smaller on-disk footprint (172 GB vs 159 GB). |
| |
| ##### AIME-2024 deep thinking-mode concurrency sweep (2026-05-25, TP=4) |
| |
| cuda graphs ON (capture sizes [1,2,4,8]), MTP `num_speculative_tokens=1`, `max-model-len=16384`. Bench JSONs at [`canada-quant/dsv4-flash-nvfp4-fp8-mtp/benchmarks/rtxpro6000/`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/tree/main/benchmarks/rtxpro6000). |
| |
| | Concurrency | Correct/30 | Stop / Length | Errors | Wall (s) | Problems/min | MTP accept | Speedup vs c=1 | |
| |---|---|---|---|---|---|---|---| |
| | c=1 (sequential) | **24/30** (80.0%) | 22 / 8 | 0 | 1453.9 | 1.24 | 90.61% | 1.0× | |
| | c=2 | **23/30** (76.7%) | 23 / 7 | 0 | 787.6 | 2.29 | 90.75% | 1.85× | |
| | c=4 | **21/30** (70.0%) | 20 / 10 | 0 | 386.6 | 4.66 | 90.93% | 3.76× | |
| | c=8 | (terminated) | — | — | — | — | — | — | |
| |
| **Findings:** |
| - **0 errors and 0 stopped-but-wrong at c=1/2/4.** Every wrong answer is length-truncated at `max_tokens`, not a quality issue — non-truncated pass@1 is essentially 100%. |
| - **MTP acceptance stable at 90.6–90.9%** across c=1/c=2/c=4. The NVFP4 `flashinfer_trtllm` MoE backend on SM 12.0 is rock-solid under all tested concurrencies (unlike the W4A16 sibling's Marlin MoE path — see Card D for that story). |
| - **c=8 throughput collapse**: TP=4 with no NVLink (PCIe-only) drops combined throughput from 450 t/s @ c=4 to ~38 t/s @ c=8 — a 12× per-request slowdown. MTP itself stayed healthy; the bottleneck is TP-allreduce communication over PCIe at high concurrency. **Recommendation for higher aggregate throughput on RTX PRO 6000: run 2 replicas at TP=2 instead of 1 replica at TP=4 c=8.** |
| |
| ### MTP draft-token acceptance per workload (B300, bs=1, k=2) |
| |
| | Workload | Acceptance | |
| |---|---| |
| | Random prompts (1024 in / 512 out) | 10.75% | |
| | Code, raw completion (HumanEval `/v1/completions`) | 67.29% | |
| | Code, chat-templated (HumanEval `/v1/chat/completions`, bs=1) | **87.96%** | |
| | Code, chat-templated, bs=4 / bs=8 / bs=16 | 88.27% / 87.92% / 88.19% | |
| | Instruction following (IFEval) | ~58.5% | |
| | AIME 2024 reasoning (thinking=high) | 81.60% | |
| |
| Acceptance does not degrade under batching — flat at 88.0% ± 0.4% across bs=1 to bs=16 on chat-templated coding. RTX PRO 6000 acceptance lands in 67–75% on the random-prompt workload (256-in/256-out, not directly comparable to the workload-specific rows above). |
| |
| ## Quick start |
| |
| One-line installer (applies all common patches): |
| |
| ```bash |
| curl -sL https://raw.githubusercontent.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/main/scripts/install_vllm_with_patches.sh | bash |
| ``` |
| |
| Serve with MTP spec-decode (B300): |
| |
| ```bash |
| CUDA_HOME=/usr/local/cuda VLLM_TEST_FORCE_FP8_MARLIN=1 \ |
| vllm serve canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP \ |
| --tensor-parallel-size 4 \ |
| --kv-cache-dtype fp8 \ |
| --speculative-config '{"method":"mtp","num_speculative_tokens":2}' |
| ``` |
| |
| Without spec-decode: |
| |
| ```bash |
| CUDA_HOME=/usr/local/cuda VLLM_TEST_FORCE_FP8_MARLIN=1 \ |
| vllm serve canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP \ |
| --tensor-parallel-size 4 \ |
| --kv-cache-dtype fp8 |
| ``` |
| |
| **Recommended TP:** |
| - **B300**: TP=4. TP=8 is slower than TP=4 at bs≥4 by up to 21.6% — per-rank MoE expert shards at TP=8 underutilize NVFP4 tensor-core kernels. |
| - **RTX PRO 6000**: TP=4 with reduced cudagraph captures + `--max-num-seqs 8 --max-num-batched-tokens 2048` to fit memory. TP=2 also works; expect 1.22× lower per-replica throughput at bs=16. |
| |
| ## Quantization recipe |
| |
| | Property | Value | |
| |---|---| |
| | Dataset | `HuggingFaceH4/ultrachat_200k` train_sft (V4 chat template) | |
| | Samples | 64 × max_seq_len 512 × batch_size 1, seed 42 | |
| | Modifier class | `QuantizationModifier` (not GPTQ — Hessian-reduce path hangs on multi-rank B300) | |
| | Hardware | calibration on B300 | |
| |
| Calibration corpus is **12× smaller than RedHat's reference recipe** (64 vs 768 samples). On the benchmarks measured, GSM8K / HumanEval / IFEval / MMLU-Pro / AIME-non-truncated all land within noise of the reference. The visible cost of reduced coverage is AIME truncation rate (5/30 vs RedHat's 2/30 at the 65K max_tokens cap), consistent with looser calibration scales producing less-converging reasoning trajectories. A v0.2 recipe with 768 samples is planned. |
| |
| | Group | Modules | Scheme | Format | |
| |---|---|---|---| |
| | attention | `wq_a, wq_b, wkv, wo_a, wo_b` (and fused variants) | FP8_BLOCK 128×128, weight static + input dynamic FP8 group=128 | `float-quantized` | |
| | experts | `w1, w2, w3` per expert | NVFP4 group=16, weight static + input dynamic "local" FP4 group=16 | `nvfp4-pack-quantized` | |
| | ignored | `lm_head`, `embed_tokens`, norms, `ffn.gate`, `ffn.shared_experts`, attn `compressor`, attn `indexer`, `attn_sink`, `hc_*` | unquantized (BF16) | n/a | |
| | MTP block (`mtp.0.*`) | all 799 keys | unquantized (BF16, preserved verbatim) | n/a | |
|
|
| ## vLLM build |
|
|
| ### Common patches (all platforms) |
|
|
| | PR | Purpose | Status | |
| |---|---|---| |
| | [`vllm-project/vllm#43248`](https://github.com/vllm-project/vllm/pull/43248) | `bool()` wrap on `is_static_input_scheme` | open | |
| | [`vllm-project/vllm#43288`](https://github.com/vllm-project/vllm/pull/43288) | `.get("scale_fmt", "ue8m0")` on missing key + BF16 `getattr` follow-up | open | |
| | [`vllm-project/vllm#43290`](https://github.com/vllm-project/vllm/pull/43290) | `weight_scale_inv`-or-`weight_scale` fallback | open | |
| | [`vllm-project/vllm#43319`](https://github.com/vllm-project/vllm/pull/43319) | MTP-quant-detect from safetensors header + BF16 `wo_a` fallback path | open | |
|
|
| The one-line installer applies all four automatically. |
|
|
| ### RTX PRO 6000 Blackwell (SM 12.0) only |
|
|
| Three SM 12.0-specific patches required on top of the four common patches. Diffs in [`patches/sm120_*.diff`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/patches) in the source repo. Full rationale at [`docs/RECIPE_RTX6000PRO.md`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/docs/RECIPE_RTX6000PRO.md). |
|
|
| 1. `VLLM_TEST_FORCE_FP8_MARLIN=1` env var — bypasses the NVFP4 MoE backend selector's `swiglu_limit` filter (no `FLASHINFER_TRTLLM` NVFP4 kernel auto-selects on SM 12.0). |
| 2. `weight_scale_inv`-or-`weight_scale` fallback in Marlin's `scaled_mm/marlin.py` (PR #43290 covers `attention.py` only; SM 12.0 also hits Marlin's pre-process site). |
| 3. Skip Marlin pre-processing for layers tagged `is_bmm=True` — DSV4 `wo_a`/`wo_b`/`compressor.wkv` use the SM 12.0 Triton `fp8_einsum` kernel directly; Marlin's tile-layout repack breaks the original `(N, K)` layout the einsum expects. |
|
|
| B300 deployments can skip all three. |
|
|
| ## Honest limitations |
|
|
| 1. **AIME truncation rate at 65K** — 5/30 of responses hit the cap on long reasoning traces vs RedHat's 2/30. Consistent with the 12×-smaller calibration corpus producing less-converging reasoning trajectories. Non-truncated pass@1 is at parity with RedHat. v0.2 with 768 samples planned. |
| 2. **NVFP4 MoE backend selector on SM 12.0** — no `FLASHINFER_TRTLLM` kernel auto-selects, requires the `VLLM_TEST_FORCE_FP8_MARLIN=1` env var to route through Marlin. Native NVFP4 SM 12.0 kernels exist in upstream vLLM (`csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu`) but aren't picked by the backend selector ([`vllm-project/vllm#31085`](https://github.com/vllm-project/vllm/issues/31085)). |
| 3. **k=1 cap on RTX PRO 6000** — SM 12.0 caps spec-decode at `num_speculative_tokens=1`; B300 supports k=2. |
| 4. **AIME thinking acceptance @ 81.60%** is lower than the chat-code 87.96% headline — workload-dependent, expected, called out for transparency. |
| 5. **IFEval re-bench 2026-05-24 (RTX PRO 6000 TP=4) — close to published B300 numbers but slightly lower.** A fresh `lm_eval ifeval --apply_chat_template num_concurrent=16` measurement on RTX PRO 6000 TP=4 (post-PR-#40923-rebuild) returned **prompt_strict 0.8429, prompt_loose 0.8780, inst_strict 0.8945, inst_loose 0.9185** — within 0.6–1.5 pp of the published markdown numbers (0.8540 / 0.8928 / 0.9005 / 0.9293). The published numbers likely came from B300 (the primary benchmark platform); RTX PRO 6000 measurements are slightly lower but consistent. Raw JSON evidence now committed at [`benchmarks/rtxpro6000/ifeval_2026_05_24.json`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/benchmarks/rtxpro6000/ifeval_2026_05_24.json). Originally flagged as "no on-disk JSON evidence"; that gap is now closed. |
|
|
| ## Reproduction |
|
|
| Full replication recipe at [`docs/recipes/nvfp4_fp8_mtp_replication.md`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/docs/recipes/nvfp4_fp8_mtp_replication.md) — covers the 14 gotchas (sm_103a vs sm_100a, calibration recipe, postprocess pipeline, vLLM build flags). |
|
|
| ## Upstream contributions filed during this work |
|
|
| | PR / Issue | Description | Status | |
| |---|---|---| |
| | [`vllm-project/vllm#43248`](https://github.com/vllm-project/vllm/pull/43248) | `bool()` wrap on `is_static_input_scheme` | open | |
| | [`vllm-project/vllm#43288`](https://github.com/vllm-project/vllm/pull/43288) | `.get("scale_fmt", "ue8m0")` defensive + BF16 follow-up | open | |
| | [`vllm-project/vllm#43290`](https://github.com/vllm-project/vllm/pull/43290) | `weight_scale_inv`-or-`weight_scale` fallback | open | |
| | [`vllm-project/vllm#43319`](https://github.com/vllm-project/vllm/pull/43319) | MTP-quant-detect from safetensors + BF16 `wo_a` fallback | open | |
| | [`vllm-project/vllm#43297`](https://github.com/vllm-project/vllm/issues/43297) | `(1,)`-shape `global_scale` loader broadcast (issue) | open | |
| | [`vllm-project/vllm#43304`](https://github.com/vllm-project/vllm/issues/43304) | MTP draft inherits main quant scheme (issue) | partially addressed by #43319 | |
| | [`vllm-project/llm-compressor#2745`](https://github.com/vllm-project/llm-compressor/issues/2745) | MTP inference-mode crash | open | |
| | [`vllm-project/compressed-tensors#711`](https://github.com/vllm-project/compressed-tensors/issues/711) | sharded-module load path | open | |
|
|
| PR [`vllm-project/vllm#42209`](https://github.com/vllm-project/vllm/pull/42209) (sychen52, xinli-sw, pavanimajety, zyongye — NVIDIA) which added the DSV4 NVFP4 MoE kernel merged 2026-05-22; this artifact serves on top of that. |
|
|
| ## Changes |
|
|
| | Date | Change | |
| |---|---| |
| | 2026-05-21 | Initial release on B300 — GSM8K 0.9181, HumanEval 0.915, IFEval 0.8540, MTP acceptance 87.96% on chat-code | |
| | 2026-05-23 | RTX PRO 6000 Blackwell (SM 12.0) validation added. TP=2 and TP=4 confirmed, MTP acceptance 67–75%, GSM8K-50 within noise of B300 | |
| | 2026-05-24 | **Cross-card finding**: AIME c=4 thinking-mode on RTX PRO 6000 shows this NVFP4 artifact produces **1/30 token-corrupted** generations vs the [W4A16-MTP sibling](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP)'s **14/30 corrupted** on the same hardware + vLLM build. The W4A16 sibling has a Marlin MoE decode race on SM 12.0; this NVFP4 artifact via `flashinfer_trtllm` MoE is the recommended deployment for batched thinking-mode on RTX PRO 6000. Filed upstream: [`jasl/vllm#12`](https://github.com/jasl/vllm/issues/12). | |
|
|
| ## Files in the artifact |
|
|
| - 35 sharded `model-*.safetensors` files + `model.safetensors.index.json` (172 GB total) |
| - `config.json` — vLLM-compatible quantization_config with fused targets + W8A8 input_activations |
| - `tokenizer.json`, `tokenizer_config.json`, `generation_config.json` — upstream DSV4-Flash |
| - `chat_template.jinja` — upstream DSV4-Flash (unchanged) |
| - `recipe.yaml` — the llm-compressor calibration recipe |
| - `README.md` — this file |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{canada-quant-dsv4-flash-nvfp4-fp8-mtp-2026, |
| title = {DeepSeek-V4-Flash NVFP4-FP8 with MTP preserved for vLLM speculative decoding}, |
| author = {Canada Quant}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT, inherited from upstream `deepseek-ai/DeepSeek-V4-Flash`. |
|
|
| ## Acknowledgments |
|
|
| - **DeepSeek** for V4-Flash and the MTP architecture. |
| - **RedHat AI** for the NVFP4-FP8 reference recipe. |
| - **PR [`#42209`](https://github.com/vllm-project/vllm/pull/42209) contributors** (sychen52, xinli-sw, pavanimajety, zyongye) for the DSV4 NVFP4 MoE kernel work that made serving possible. |
| - **[`canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP`](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP)** (W4A16 sibling) for the alias-injection pattern and MTP acceptance methodology. |
| - vLLM, llm-compressor, compressed-tensors, FlashInfer maintainers. |
|
|