pastapaul's picture
Upload README.md with huggingface_hub
3af55f1 verified
---
license: mit
base_model: deepseek-ai/DeepSeek-V4-Flash
base_model_relation: quantized
language:
- en
- zh
library_name: vllm
pipeline_tag: text-generation
tags:
- deepseek
- deepseek_v4
- compressed-tensors
- nvfp4
- fp8
- mtp
- speculative-decoding
- mixture-of-experts
- moe
- vllm
---
# canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP
NVFP4 routed experts + FP8 block 128×128 attention + **BF16 Multi-Token Prediction (MTP) draft head retained** — same quantization math as [`RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8`](https://huggingface.co/RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8) but with the MTP block preserved in the saved weights so vLLM can load it with `--speculative-config method=mtp`.
## TL;DR
| | |
|---|---|
| **Recommended hardware** | 4× B300 TP=4 · or RTX PRO 6000 Blackwell at **TP=2 (2 GPUs/replica)** or **TP=4 (4 GPUs/replica)** — both validated |
| **Quality** | GSM8K 91.81% strict (8-shot); MMLU-Pro 81.13%; HumanEval pass@1 0.915 (EvalPlus) |
| **Throughput** | 278.68 output tok/s @ bs=1 chat-code on B300 TP=4 (2.13× vs RedHat NVFP4); RTX PRO 6000 **94.6 @ TP=2** / **101.0 @ TP=4** at bs=1 |
| **MTP acceptance** | 87.96% on chat-code at bs=1 / k=2 — flat across bs=1 to bs=16 |
| **Spec-decode speedup** | **1.8–2.1× decode** vs RedHat NVFP4 (workload-dependent) |
| **Differentiator** | Only V4-Flash NVFP4 quant where `--speculative-config method=mtp` actually fires — RedHat's artifact dropped MTP during calibration load |
## Family / related artifacts
| Repo | Role | Relation to this artifact |
|---|---|---|
| [`canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP`](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP) | sibling | W4A16 routed experts (Hopper-compatible), MTP retained — same MTP-preservation pattern. **Note**: on RTX PRO 6000 (SM 12.0) the W4A16 sibling's Marlin MoE decode path corrupts ~50% of generations under concurrent thinking-mode load. **For batched thinking-mode workloads on SM 12.0, this NVFP4 artifact is the recommended choice.** See [Card D's Honest limitations](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP#honest-limitations) and the [debug log](https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp/blob/main/docs/findings/sm12x_token_corruption_2026_05_24.md). |
| [`canada-quant/DeepSeek-V4-Flash-W4A16-FP8`](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8) | predecessor (no-MTP baseline) | W4A16 + FP8 without MTP — broadest hardware compatibility |
| [`canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP`](https://huggingface.co/canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP) | larger sibling | Same NVFP4 + MTP recipe applied to V4-Pro; B300-only deployment |
| [`RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8`](https://huggingface.co/RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8) | upstream reference | Same quant math; MTP block dropped by `transformers` silent-strip (the bug this artifact fixes) |
## Why this exists
The HF transformers DSV4 modeling class declares `_keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"]`, which silently strips MTP keys during the calibration load path. RedHat's NVFP4-FP8 artifact ran through that path, so their saved weights don't include MTP — and serving cannot use V4-Flash's spec-decode head. This artifact patches the modeling class during calibration so MTP keys (`mtp.0.*`, 799 tensors) survive at BF16. The result: an NVFP4 artifact that's structurally identical to RedHat's on the math, but loadable with `--speculative-config method=mtp` for ~2× decode speedup.
## Architecture & precision
### Base model
| Property | Value |
|---|---|
| Total parameters | ~284 B (~13 B active per token) |
| Decoder layers | 43 |
| Routed experts / layer | 256 (top-K = 6) |
| Hidden size | 4096 |
| Base BF16 size | ~600 GB |
| Quantized size | **172 GB** across 35 safetensors shards |
### Component precisions
| Component | Format | Method |
|---|---|---|
| Routed FFN experts (`w1`, `w2`, `w3` per expert) | NVFP4 group=16 | weight static + input dynamic "local" FP4 group=16, `nvfp4-pack-quantized` |
| Attention path (`wq_a`, `wq_b`, `wkv`, `wo_a`, `wo_b` and fused) | FP8_BLOCK 128×128 | weight static + input dynamic FP8 group=128, `float-quantized` |
| **MTP block (`mtp.0.*`)** | **BF16** | **Preserved verbatim (799 tensors)** |
| `lm_head`, `embed_tokens`, norms, `ffn.gate`, `ffn.shared_experts`, attn `compressor`, attn `indexer`, `attn_sink`, `hc_*` | BF16 | Unquantized |
## Hardware validated
| Platform | SM | HBM/GPU | Interconnect | TP | Role |
|---|---|---|---|---|---|
| 4× NVIDIA B300 SXM6 AC | 10.3, sm_103a | 288 GB HBM3e | NVLink | 4 (TP=8 for BF16 reference) | Primary — all accuracy + throughput numbers |
| 4× NVIDIA RTX PRO 6000 Blackwell Server Edition | 12.0, sm_120 | 96 GB HBM | PCIe | **TP=2 (2 GPUs, 2 replicas on a 4-GPU box) or TP=4 (4 GPUs, 1 replica)** | Also validated — both TP configs + GSM8K-50 cross-check, 3 extra patches |
Both platforms serve cuda graphs ON. Same artifact, no weight changes between SKUs.
## Benchmarks
### Quality (hardware-invariant — measured on B300)
Measured 2026-05-21 on 4× B300 SXM6 AC (TP=4 for quant configs, TP=8 for BF16 reference which doesn't fit at TP=4). Greedy, temperature 0. The same artifact serves on RTX PRO 6000 Blackwell with no weight changes; GSM8K-50 cross-check: 88% strict TP=2 / 90% strict TP=4 on RTX PRO 6000 vs 91.81% strict full-set on B300 (within noise).
| Benchmark | Setting | This artifact | BF16 + MTP reference | `RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8` (no MTP) |
|---|---|---|---|---|
| AIME 2024 | raw pass@1, thinking=high, max_tokens=65536 | 25/30 = 83.33% | 25/30 = 83.33% | 27/30 = 90.00% |
| AIME 2024 | non-truncated pass@1 | 24/25 = 96.00% | 25/26 = 96.15% | 27/28 = 96.43% |
| AIME 2024 | wall-clock for 30 problems @ bs=8 | **476 s** | 490 s | 1405 s |
| GSM8K | 8-shot, strict-match | 0.9181 | 0.9484 / 0.9522 (no-MTP / MTP) | 0.910 (self-reported) |
| GSM8K | 8-shot, flexible-extract | 0.9515 | 0.9477 / 0.9515 | not reported |
| MMLU-Pro | 5-shot, custom-extract | 0.8113 | not measured | not reported |
| HumanEval | pass@1 (EvalPlus) | **0.915** | not measured | 0.896 |
| HumanEval+ | pass@1 (EvalPlus) | 0.848 | not measured | 0.860 |
| IFEval | prompt-strict (B300) | **0.8540** | not measured | 0.8207 |
| IFEval | prompt-strict (RTX PRO 6000 TP=4, 2026-05-24, [JSON evidence](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/benchmarks/rtxpro6000/ifeval_2026_05_24.json)) | **0.8429** (-1.1pp vs B300) | — | — |
On raw AIME pass@1, RedHat scores higher (27/30 vs ours 25/30) — but the gap is **entirely truncation rate** at the 65K max_tokens cap (RedHat truncated 2/30, ours 5/30). On non-truncated pass@1, all three configs are within 0.4 pt of each other (96.0–96.4%). Quantization quality is equivalent on AIME 2024; the differentiator is wall-clock.
### Throughput
#### 4× B300 SXM6 (sm_103a, NVLink, TP=4)
Same hardware, same TP=4, same prompts as the quality table.
| Workload | Operating point | This artifact | RedHat NVFP4 (no MTP) | Ratio |
|---|---|---|---|---|
| AIME 2024 reasoning (thinking=high, bs=8) | wall-clock for 30 problems | 476 s | 1405 s | **2.95×** |
| AIME 2024 reasoning | per-request median output tok/s | 182.9 | 99.6 | **1.84×** |
| Coding (HumanEval chat, bs=1) | output tok/s | **278.68** | 131.06 | 2.13× |
| Coding (HumanEval chat, bs=4) | output tok/s | 649.35 | 417.87 | 1.55× |
| Coding (HumanEval chat, bs=8) | output tok/s | 1104.89 | 673.12 | 1.64× |
| Coding (HumanEval chat, bs=16) | output tok/s | 1577.20 | 1007.78 | 1.56× |
Two ratios to disambiguate:
- **Pure decode throughput**: at bs=1 chat coding, 2.13× faster. On AIME reasoning at bs=8, per-request median is 182.9 vs 99.6 tok/s — **1.84×**. The decode ratio is workload-dependent (acceptance % varies) but lands in the **1.8–2.1×** range across measured workloads.
- **AIME batch wall-clock**: 1405 s / 476 s = **2.95×**. This includes the truncation-rate differential at 65K — 5/30 of our responses truncated vs 2/30 of RedHat's, and truncated responses run to the cap, inflating RedHat's total wall-clock. The 2.95× number is "time to run AIME 2024 end-to-end," not "raw decode speed."
#### 4× RTX PRO 6000 Blackwell (sm_120, PCIe, TP=2 and TP=4)
Validated 2026-05-23 on a Brev `familiar-teal-worm` instance. Per-replica `vllm bench serve` random 256-in/256-out, `num_speculative_tokens=1` (SM 12.0 caps spec at k=1). MTP-on for all rows.
| Config | bs=1 output tok/s | bs=4 output tok/s | bs=16 output tok/s | bs=1 TPOT median | MTP acceptance | GSM8K-50 strict |
|---|---|---|---|---|---|---|
| TP=2 | 94.6 | 218.5 | 360.5 | 9.05 ms | 70–73% | 88% |
| TP=4 | **101.0** | 254.0 | **440.1** | **8.20 ms** | 67–75% | 90% |
At bs=16, TP=4 is 1.22× faster per-replica than TP=2 on this hardware — opposite of B300, where TP=4 beats TP=8 due to NVFP4 tensor-core underutilization. RTX PRO 6000's slower PCIe interconnect plus lower per-GPU compute means extra parallelism still pays off at all batch sizes measured.
For context on the same RTX PRO 6000 box, the [W4A16-FP8-MTP sibling](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP) measured 98.83 tok/s at TP=2 bs=1 — equivalent decode throughput, with NVFP4 trading ~4% per-replica throughput for ~10% smaller on-disk footprint (172 GB vs 159 GB).
##### AIME-2024 deep thinking-mode concurrency sweep (2026-05-25, TP=4)
cuda graphs ON (capture sizes [1,2,4,8]), MTP `num_speculative_tokens=1`, `max-model-len=16384`. Bench JSONs at [`canada-quant/dsv4-flash-nvfp4-fp8-mtp/benchmarks/rtxpro6000/`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/tree/main/benchmarks/rtxpro6000).
| Concurrency | Correct/30 | Stop / Length | Errors | Wall (s) | Problems/min | MTP accept | Speedup vs c=1 |
|---|---|---|---|---|---|---|---|
| c=1 (sequential) | **24/30** (80.0%) | 22 / 8 | 0 | 1453.9 | 1.24 | 90.61% | 1.0× |
| c=2 | **23/30** (76.7%) | 23 / 7 | 0 | 787.6 | 2.29 | 90.75% | 1.85× |
| c=4 | **21/30** (70.0%) | 20 / 10 | 0 | 386.6 | 4.66 | 90.93% | 3.76× |
| c=8 | (terminated) | — | — | — | — | — | — |
**Findings:**
- **0 errors and 0 stopped-but-wrong at c=1/2/4.** Every wrong answer is length-truncated at `max_tokens`, not a quality issue — non-truncated pass@1 is essentially 100%.
- **MTP acceptance stable at 90.6–90.9%** across c=1/c=2/c=4. The NVFP4 `flashinfer_trtllm` MoE backend on SM 12.0 is rock-solid under all tested concurrencies (unlike the W4A16 sibling's Marlin MoE path — see Card D for that story).
- **c=8 throughput collapse**: TP=4 with no NVLink (PCIe-only) drops combined throughput from 450 t/s @ c=4 to ~38 t/s @ c=8 — a 12× per-request slowdown. MTP itself stayed healthy; the bottleneck is TP-allreduce communication over PCIe at high concurrency. **Recommendation for higher aggregate throughput on RTX PRO 6000: run 2 replicas at TP=2 instead of 1 replica at TP=4 c=8.**
### MTP draft-token acceptance per workload (B300, bs=1, k=2)
| Workload | Acceptance |
|---|---|
| Random prompts (1024 in / 512 out) | 10.75% |
| Code, raw completion (HumanEval `/v1/completions`) | 67.29% |
| Code, chat-templated (HumanEval `/v1/chat/completions`, bs=1) | **87.96%** |
| Code, chat-templated, bs=4 / bs=8 / bs=16 | 88.27% / 87.92% / 88.19% |
| Instruction following (IFEval) | ~58.5% |
| AIME 2024 reasoning (thinking=high) | 81.60% |
Acceptance does not degrade under batching — flat at 88.0% ± 0.4% across bs=1 to bs=16 on chat-templated coding. RTX PRO 6000 acceptance lands in 67–75% on the random-prompt workload (256-in/256-out, not directly comparable to the workload-specific rows above).
## Quick start
One-line installer (applies all common patches):
```bash
curl -sL https://raw.githubusercontent.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/main/scripts/install_vllm_with_patches.sh | bash
```
Serve with MTP spec-decode (B300):
```bash
CUDA_HOME=/usr/local/cuda VLLM_TEST_FORCE_FP8_MARLIN=1 \
vllm serve canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP \
--tensor-parallel-size 4 \
--kv-cache-dtype fp8 \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'
```
Without spec-decode:
```bash
CUDA_HOME=/usr/local/cuda VLLM_TEST_FORCE_FP8_MARLIN=1 \
vllm serve canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP \
--tensor-parallel-size 4 \
--kv-cache-dtype fp8
```
**Recommended TP:**
- **B300**: TP=4. TP=8 is slower than TP=4 at bs≥4 by up to 21.6% — per-rank MoE expert shards at TP=8 underutilize NVFP4 tensor-core kernels.
- **RTX PRO 6000**: TP=4 with reduced cudagraph captures + `--max-num-seqs 8 --max-num-batched-tokens 2048` to fit memory. TP=2 also works; expect 1.22× lower per-replica throughput at bs=16.
## Quantization recipe
| Property | Value |
|---|---|
| Dataset | `HuggingFaceH4/ultrachat_200k` train_sft (V4 chat template) |
| Samples | 64 × max_seq_len 512 × batch_size 1, seed 42 |
| Modifier class | `QuantizationModifier` (not GPTQ — Hessian-reduce path hangs on multi-rank B300) |
| Hardware | calibration on B300 |
Calibration corpus is **12× smaller than RedHat's reference recipe** (64 vs 768 samples). On the benchmarks measured, GSM8K / HumanEval / IFEval / MMLU-Pro / AIME-non-truncated all land within noise of the reference. The visible cost of reduced coverage is AIME truncation rate (5/30 vs RedHat's 2/30 at the 65K max_tokens cap), consistent with looser calibration scales producing less-converging reasoning trajectories. A v0.2 recipe with 768 samples is planned.
| Group | Modules | Scheme | Format |
|---|---|---|---|
| attention | `wq_a, wq_b, wkv, wo_a, wo_b` (and fused variants) | FP8_BLOCK 128×128, weight static + input dynamic FP8 group=128 | `float-quantized` |
| experts | `w1, w2, w3` per expert | NVFP4 group=16, weight static + input dynamic "local" FP4 group=16 | `nvfp4-pack-quantized` |
| ignored | `lm_head`, `embed_tokens`, norms, `ffn.gate`, `ffn.shared_experts`, attn `compressor`, attn `indexer`, `attn_sink`, `hc_*` | unquantized (BF16) | n/a |
| MTP block (`mtp.0.*`) | all 799 keys | unquantized (BF16, preserved verbatim) | n/a |
## vLLM build
### Common patches (all platforms)
| PR | Purpose | Status |
|---|---|---|
| [`vllm-project/vllm#43248`](https://github.com/vllm-project/vllm/pull/43248) | `bool()` wrap on `is_static_input_scheme` | open |
| [`vllm-project/vllm#43288`](https://github.com/vllm-project/vllm/pull/43288) | `.get("scale_fmt", "ue8m0")` on missing key + BF16 `getattr` follow-up | open |
| [`vllm-project/vllm#43290`](https://github.com/vllm-project/vllm/pull/43290) | `weight_scale_inv`-or-`weight_scale` fallback | open |
| [`vllm-project/vllm#43319`](https://github.com/vllm-project/vllm/pull/43319) | MTP-quant-detect from safetensors header + BF16 `wo_a` fallback path | open |
The one-line installer applies all four automatically.
### RTX PRO 6000 Blackwell (SM 12.0) only
Three SM 12.0-specific patches required on top of the four common patches. Diffs in [`patches/sm120_*.diff`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/patches) in the source repo. Full rationale at [`docs/RECIPE_RTX6000PRO.md`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/docs/RECIPE_RTX6000PRO.md).
1. `VLLM_TEST_FORCE_FP8_MARLIN=1` env var — bypasses the NVFP4 MoE backend selector's `swiglu_limit` filter (no `FLASHINFER_TRTLLM` NVFP4 kernel auto-selects on SM 12.0).
2. `weight_scale_inv`-or-`weight_scale` fallback in Marlin's `scaled_mm/marlin.py` (PR #43290 covers `attention.py` only; SM 12.0 also hits Marlin's pre-process site).
3. Skip Marlin pre-processing for layers tagged `is_bmm=True` — DSV4 `wo_a`/`wo_b`/`compressor.wkv` use the SM 12.0 Triton `fp8_einsum` kernel directly; Marlin's tile-layout repack breaks the original `(N, K)` layout the einsum expects.
B300 deployments can skip all three.
## Honest limitations
1. **AIME truncation rate at 65K** — 5/30 of responses hit the cap on long reasoning traces vs RedHat's 2/30. Consistent with the 12×-smaller calibration corpus producing less-converging reasoning trajectories. Non-truncated pass@1 is at parity with RedHat. v0.2 with 768 samples planned.
2. **NVFP4 MoE backend selector on SM 12.0** — no `FLASHINFER_TRTLLM` kernel auto-selects, requires the `VLLM_TEST_FORCE_FP8_MARLIN=1` env var to route through Marlin. Native NVFP4 SM 12.0 kernels exist in upstream vLLM (`csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu`) but aren't picked by the backend selector ([`vllm-project/vllm#31085`](https://github.com/vllm-project/vllm/issues/31085)).
3. **k=1 cap on RTX PRO 6000** — SM 12.0 caps spec-decode at `num_speculative_tokens=1`; B300 supports k=2.
4. **AIME thinking acceptance @ 81.60%** is lower than the chat-code 87.96% headline — workload-dependent, expected, called out for transparency.
5. **IFEval re-bench 2026-05-24 (RTX PRO 6000 TP=4) — close to published B300 numbers but slightly lower.** A fresh `lm_eval ifeval --apply_chat_template num_concurrent=16` measurement on RTX PRO 6000 TP=4 (post-PR-#40923-rebuild) returned **prompt_strict 0.8429, prompt_loose 0.8780, inst_strict 0.8945, inst_loose 0.9185** — within 0.6–1.5 pp of the published markdown numbers (0.8540 / 0.8928 / 0.9005 / 0.9293). The published numbers likely came from B300 (the primary benchmark platform); RTX PRO 6000 measurements are slightly lower but consistent. Raw JSON evidence now committed at [`benchmarks/rtxpro6000/ifeval_2026_05_24.json`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/benchmarks/rtxpro6000/ifeval_2026_05_24.json). Originally flagged as "no on-disk JSON evidence"; that gap is now closed.
## Reproduction
Full replication recipe at [`docs/recipes/nvfp4_fp8_mtp_replication.md`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/docs/recipes/nvfp4_fp8_mtp_replication.md) — covers the 14 gotchas (sm_103a vs sm_100a, calibration recipe, postprocess pipeline, vLLM build flags).
## Upstream contributions filed during this work
| PR / Issue | Description | Status |
|---|---|---|
| [`vllm-project/vllm#43248`](https://github.com/vllm-project/vllm/pull/43248) | `bool()` wrap on `is_static_input_scheme` | open |
| [`vllm-project/vllm#43288`](https://github.com/vllm-project/vllm/pull/43288) | `.get("scale_fmt", "ue8m0")` defensive + BF16 follow-up | open |
| [`vllm-project/vllm#43290`](https://github.com/vllm-project/vllm/pull/43290) | `weight_scale_inv`-or-`weight_scale` fallback | open |
| [`vllm-project/vllm#43319`](https://github.com/vllm-project/vllm/pull/43319) | MTP-quant-detect from safetensors + BF16 `wo_a` fallback | open |
| [`vllm-project/vllm#43297`](https://github.com/vllm-project/vllm/issues/43297) | `(1,)`-shape `global_scale` loader broadcast (issue) | open |
| [`vllm-project/vllm#43304`](https://github.com/vllm-project/vllm/issues/43304) | MTP draft inherits main quant scheme (issue) | partially addressed by #43319 |
| [`vllm-project/llm-compressor#2745`](https://github.com/vllm-project/llm-compressor/issues/2745) | MTP inference-mode crash | open |
| [`vllm-project/compressed-tensors#711`](https://github.com/vllm-project/compressed-tensors/issues/711) | sharded-module load path | open |
PR [`vllm-project/vllm#42209`](https://github.com/vllm-project/vllm/pull/42209) (sychen52, xinli-sw, pavanimajety, zyongye — NVIDIA) which added the DSV4 NVFP4 MoE kernel merged 2026-05-22; this artifact serves on top of that.
## Changes
| Date | Change |
|---|---|
| 2026-05-21 | Initial release on B300 — GSM8K 0.9181, HumanEval 0.915, IFEval 0.8540, MTP acceptance 87.96% on chat-code |
| 2026-05-23 | RTX PRO 6000 Blackwell (SM 12.0) validation added. TP=2 and TP=4 confirmed, MTP acceptance 67–75%, GSM8K-50 within noise of B300 |
| 2026-05-24 | **Cross-card finding**: AIME c=4 thinking-mode on RTX PRO 6000 shows this NVFP4 artifact produces **1/30 token-corrupted** generations vs the [W4A16-MTP sibling](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP)'s **14/30 corrupted** on the same hardware + vLLM build. The W4A16 sibling has a Marlin MoE decode race on SM 12.0; this NVFP4 artifact via `flashinfer_trtllm` MoE is the recommended deployment for batched thinking-mode on RTX PRO 6000. Filed upstream: [`jasl/vllm#12`](https://github.com/jasl/vllm/issues/12). |
## Files in the artifact
- 35 sharded `model-*.safetensors` files + `model.safetensors.index.json` (172 GB total)
- `config.json` — vLLM-compatible quantization_config with fused targets + W8A8 input_activations
- `tokenizer.json`, `tokenizer_config.json`, `generation_config.json` — upstream DSV4-Flash
- `chat_template.jinja` — upstream DSV4-Flash (unchanged)
- `recipe.yaml` — the llm-compressor calibration recipe
- `README.md` — this file
## Citation
```bibtex
@misc{canada-quant-dsv4-flash-nvfp4-fp8-mtp-2026,
title = {DeepSeek-V4-Flash NVFP4-FP8 with MTP preserved for vLLM speculative decoding},
author = {Canada Quant},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP}
}
```
## License
MIT, inherited from upstream `deepseek-ai/DeepSeek-V4-Flash`.
## Acknowledgments
- **DeepSeek** for V4-Flash and the MTP architecture.
- **RedHat AI** for the NVFP4-FP8 reference recipe.
- **PR [`#42209`](https://github.com/vllm-project/vllm/pull/42209) contributors** (sychen52, xinli-sw, pavanimajety, zyongye) for the DSV4 NVFP4 MoE kernel work that made serving possible.
- **[`canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP`](https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP)** (W4A16 sibling) for the alias-injection pattern and MTP acceptance methodology.
- vLLM, llm-compressor, compressed-tensors, FlashInfer maintainers.