README: license=mit, honest calibration disclosure, AIME wall-clock vs decode disambiguated, absolute doc links
Browse files
README.md
CHANGED
|
@@ -51,12 +51,16 @@ Same 4× B300 hardware, same TP=4, same prompts:
|
|
| 51 |
| Workload | Operating point | This artifact | RedHat |
|
| 52 |
|---|---|---|---|
|
| 53 |
| AIME 2024 reasoning (thinking=high, c=8) | wall-clock for 30 problems | 476s | 1405s |
|
|
|
|
| 54 |
| Coding (HumanEval chat, c=1) | output tok/s | 278.68 | 131.06 |
|
| 55 |
| Coding (HumanEval chat, c=4) | output tok/s | 649.35 | 417.87 |
|
| 56 |
| Coding (HumanEval chat, c=8) | output tok/s | 1104.89 | 673.12 |
|
| 57 |
| Coding (HumanEval chat, c=16) | output tok/s | 1577.20 | 1007.78 |
|
| 58 |
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
## MTP draft acceptance per workload
|
| 62 |
|
|
@@ -77,7 +81,7 @@ TP=4 on 4× B300 (or equivalent Blackwell SXM6 with ≥250 GB HBM each). On this
|
|
| 77 |
|
| 78 |
## Quick start
|
| 79 |
|
| 80 |
-
See [`docs/QUICKSTART.md`](docs/QUICKSTART.md) in the source repo for the full build recipe, or use the one-line installer:
|
| 81 |
|
| 82 |
```bash
|
| 83 |
curl -sL https://raw.githubusercontent.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/main/scripts/install_vllm_with_patches.sh | bash
|
|
@@ -122,11 +126,13 @@ The math of the quantization is the same. The architectural difference is MTP re
|
|
| 122 |
| ignored | `lm_head`, `embed_tokens`, norms, `ffn.gate`, `ffn.shared_experts`, attn `compressor`, attn `indexer`, `attn_sink`, `hc_*` | unquantized (BF16) | n/a |
|
| 123 |
| MTP block (`mtp.0.*`) | all 799 keys | unquantized (BF16, preserved verbatim) | n/a |
|
| 124 |
|
| 125 |
-
Calibration corpus: HuggingFaceH4/ultrachat_200k train_sft, 64 samples × max_seq_len 512 × batch_size 1, seed 42. RedHat
|
|
|
|
|
|
|
| 126 |
|
| 127 |
## vLLM patches required
|
| 128 |
|
| 129 |
-
The artifact loads on vLLM mainline + these 5 patches. They're filed upstream and waiting on review. See [`docs/VLLM_SETUP_ISSUES.md`](docs/VLLM_SETUP_ISSUES.md) for the exact diffs.
|
| 130 |
|
| 131 |
1. PR [#43248](https://github.com/vllm-project/vllm/pull/43248) — `bool()` wrap on `is_static_input_scheme`
|
| 132 |
2. PR [#43288](https://github.com/vllm-project/vllm/pull/43288) — `.get("scale_fmt", "ue8m0")` on missing key + BF16 `getattr` follow-up
|
|
@@ -145,7 +151,7 @@ The one-line installer applies all four automatically.
|
|
| 145 |
|
| 146 |
## Reproduction
|
| 147 |
|
| 148 |
-
Full replication recipe in [`docs/recipes/nvfp4_fp8_mtp_replication.md`](docs/recipes/nvfp4_fp8_mtp_replication.md) — covers the 14 gotchas (sm_103a vs sm_100a, calibration recipe, postprocess pipeline, vLLM build flags).
|
| 149 |
|
| 150 |
## Citation
|
| 151 |
|
|
|
|
| 51 |
| Workload | Operating point | This artifact | RedHat |
|
| 52 |
|---|---|---|---|
|
| 53 |
| AIME 2024 reasoning (thinking=high, c=8) | wall-clock for 30 problems | 476s | 1405s |
|
| 54 |
+
| AIME 2024 reasoning | per-request median tok/s | 182.9 | 99.6 |
|
| 55 |
| Coding (HumanEval chat, c=1) | output tok/s | 278.68 | 131.06 |
|
| 56 |
| Coding (HumanEval chat, c=4) | output tok/s | 649.35 | 417.87 |
|
| 57 |
| Coding (HumanEval chat, c=8) | output tok/s | 1104.89 | 673.12 |
|
| 58 |
| Coding (HumanEval chat, c=16) | output tok/s | 1577.20 | 1007.78 |
|
| 59 |
|
| 60 |
+
Two different ratios to disambiguate:
|
| 61 |
+
|
| 62 |
+
- **Pure decode throughput**: at c=1 chat coding, ours is 2.13× faster. On AIME reasoning at c=8, the per-request median decode rate is 182.9 vs 99.6 tok/s — a **1.84×** decode speedup. The decode ratio is workload-dependent (acceptance % varies) but lands in the 1.8–2.1× range across the workloads measured.
|
| 63 |
+
- **AIME batch wall-clock**: 1405s / 476s = **2.95×**. This includes the truncation-rate differential at the 65K max_tokens cap: 5/30 of our responses truncated vs 2/30 of RedHat's, and truncated responses run to the cap, inflating RedHat's total wall-clock. The 2.95× ratio measures "time to run AIME 2024 end-to-end" rather than pure decode speed, and is the right number to cite for "how long does the bench take" but not for "how fast does the model decode."
|
| 64 |
|
| 65 |
## MTP draft acceptance per workload
|
| 66 |
|
|
|
|
| 81 |
|
| 82 |
## Quick start
|
| 83 |
|
| 84 |
+
See [`docs/QUICKSTART.md`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/docs/QUICKSTART.md) in the source repo for the full build recipe, or use the one-line installer:
|
| 85 |
|
| 86 |
```bash
|
| 87 |
curl -sL https://raw.githubusercontent.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/main/scripts/install_vllm_with_patches.sh | bash
|
|
|
|
| 126 |
| ignored | `lm_head`, `embed_tokens`, norms, `ffn.gate`, `ffn.shared_experts`, attn `compressor`, attn `indexer`, `attn_sink`, `hc_*` | unquantized (BF16) | n/a |
|
| 127 |
| MTP block (`mtp.0.*`) | all 799 keys | unquantized (BF16, preserved verbatim) | n/a |
|
| 128 |
|
| 129 |
+
Calibration corpus: HuggingFaceH4/ultrachat_200k train_sft, **64 samples** × max_seq_len 512 × batch_size 1, seed 42. RedHat's reference recipe uses 768 samples × 512 from the same corpus.
|
| 130 |
+
|
| 131 |
+
The 64-sample recipe was used due to time/compute constraints during initial bring-up (12× less coverage than RedHat). On the benchmarks measured here, GSM8K / HumanEval / IFEval / MMLU-Pro / AIME-non-truncated all land within noise of the reference. The visible cost of the reduced coverage is AIME truncation rate: 5/30 of our responses hit the 65K max_tokens cap on long reasoning traces vs 2/30 of RedHat's, which is consistent with looser calibration scales producing less-converging reasoning trajectories. A v0.2 recipe with 768 samples is planned.
|
| 132 |
|
| 133 |
## vLLM patches required
|
| 134 |
|
| 135 |
+
The artifact loads on vLLM mainline + these 5 patches. They're filed upstream and waiting on review. See [`docs/VLLM_SETUP_ISSUES.md`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/docs/VLLM_SETUP_ISSUES.md) for the exact diffs.
|
| 136 |
|
| 137 |
1. PR [#43248](https://github.com/vllm-project/vllm/pull/43248) — `bool()` wrap on `is_static_input_scheme`
|
| 138 |
2. PR [#43288](https://github.com/vllm-project/vllm/pull/43288) — `.get("scale_fmt", "ue8m0")` on missing key + BF16 `getattr` follow-up
|
|
|
|
| 151 |
|
| 152 |
## Reproduction
|
| 153 |
|
| 154 |
+
Full replication recipe in [`docs/recipes/nvfp4_fp8_mtp_replication.md`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/docs/recipes/nvfp4_fp8_mtp_replication.md) — covers the 14 gotchas (sm_103a vs sm_100a, calibration recipe, postprocess pipeline, vLLM build flags).
|
| 155 |
|
| 156 |
## Citation
|
| 157 |
|