pastapaul commited on
Commit
4c83e4b
·
verified ·
1 Parent(s): e03e070

README: license=mit, honest calibration disclosure, AIME wall-clock vs decode disambiguated, absolute doc links

Browse files
Files changed (1) hide show
  1. README.md +11 -5
README.md CHANGED
@@ -51,12 +51,16 @@ Same 4× B300 hardware, same TP=4, same prompts:
51
  | Workload | Operating point | This artifact | RedHat |
52
  |---|---|---|---|
53
  | AIME 2024 reasoning (thinking=high, c=8) | wall-clock for 30 problems | 476s | 1405s |
 
54
  | Coding (HumanEval chat, c=1) | output tok/s | 278.68 | 131.06 |
55
  | Coding (HumanEval chat, c=4) | output tok/s | 649.35 | 417.87 |
56
  | Coding (HumanEval chat, c=8) | output tok/s | 1104.89 | 673.12 |
57
  | Coding (HumanEval chat, c=16) | output tok/s | 1577.20 | 1007.78 |
58
 
59
- The speedup ratio at c=1 chat coding is 2.13×; at AIME reasoning it's 2.95×. Reasoning has longer outputs (~17K tokens vs ~500 for coding), which amplifies MTP's per-step advantage. The speedup is bigger on workloads with long outputs and high token-level predictability.
 
 
 
60
 
61
  ## MTP draft acceptance per workload
62
 
@@ -77,7 +81,7 @@ TP=4 on 4× B300 (or equivalent Blackwell SXM6 with ≥250 GB HBM each). On this
77
 
78
  ## Quick start
79
 
80
- See [`docs/QUICKSTART.md`](docs/QUICKSTART.md) in the source repo for the full build recipe, or use the one-line installer:
81
 
82
  ```bash
83
  curl -sL https://raw.githubusercontent.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/main/scripts/install_vllm_with_patches.sh | bash
@@ -122,11 +126,13 @@ The math of the quantization is the same. The architectural difference is MTP re
122
  | ignored | `lm_head`, `embed_tokens`, norms, `ffn.gate`, `ffn.shared_experts`, attn `compressor`, attn `indexer`, `attn_sink`, `hc_*` | unquantized (BF16) | n/a |
123
  | MTP block (`mtp.0.*`) | all 799 keys | unquantized (BF16, preserved verbatim) | n/a |
124
 
125
- Calibration corpus: HuggingFaceH4/ultrachat_200k train_sft, 64 samples × max_seq_len 512 × batch_size 1, seed 42. RedHat used 768 samples × 512 from the same corpus; the 64-sample recipe is faster and produces calibration scales close enough that quality benchmarks land within noise.
 
 
126
 
127
  ## vLLM patches required
128
 
129
- The artifact loads on vLLM mainline + these 5 patches. They're filed upstream and waiting on review. See [`docs/VLLM_SETUP_ISSUES.md`](docs/VLLM_SETUP_ISSUES.md) for the exact diffs.
130
 
131
  1. PR [#43248](https://github.com/vllm-project/vllm/pull/43248) — `bool()` wrap on `is_static_input_scheme`
132
  2. PR [#43288](https://github.com/vllm-project/vllm/pull/43288) — `.get("scale_fmt", "ue8m0")` on missing key + BF16 `getattr` follow-up
@@ -145,7 +151,7 @@ The one-line installer applies all four automatically.
145
 
146
  ## Reproduction
147
 
148
- Full replication recipe in [`docs/recipes/nvfp4_fp8_mtp_replication.md`](docs/recipes/nvfp4_fp8_mtp_replication.md) — covers the 14 gotchas (sm_103a vs sm_100a, calibration recipe, postprocess pipeline, vLLM build flags).
149
 
150
  ## Citation
151
 
 
51
  | Workload | Operating point | This artifact | RedHat |
52
  |---|---|---|---|
53
  | AIME 2024 reasoning (thinking=high, c=8) | wall-clock for 30 problems | 476s | 1405s |
54
+ | AIME 2024 reasoning | per-request median tok/s | 182.9 | 99.6 |
55
  | Coding (HumanEval chat, c=1) | output tok/s | 278.68 | 131.06 |
56
  | Coding (HumanEval chat, c=4) | output tok/s | 649.35 | 417.87 |
57
  | Coding (HumanEval chat, c=8) | output tok/s | 1104.89 | 673.12 |
58
  | Coding (HumanEval chat, c=16) | output tok/s | 1577.20 | 1007.78 |
59
 
60
+ Two different ratios to disambiguate:
61
+
62
+ - **Pure decode throughput**: at c=1 chat coding, ours is 2.13× faster. On AIME reasoning at c=8, the per-request median decode rate is 182.9 vs 99.6 tok/s — a **1.84×** decode speedup. The decode ratio is workload-dependent (acceptance % varies) but lands in the 1.8–2.1× range across the workloads measured.
63
+ - **AIME batch wall-clock**: 1405s / 476s = **2.95×**. This includes the truncation-rate differential at the 65K max_tokens cap: 5/30 of our responses truncated vs 2/30 of RedHat's, and truncated responses run to the cap, inflating RedHat's total wall-clock. The 2.95× ratio measures "time to run AIME 2024 end-to-end" rather than pure decode speed, and is the right number to cite for "how long does the bench take" but not for "how fast does the model decode."
64
 
65
  ## MTP draft acceptance per workload
66
 
 
81
 
82
  ## Quick start
83
 
84
+ See [`docs/QUICKSTART.md`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/docs/QUICKSTART.md) in the source repo for the full build recipe, or use the one-line installer:
85
 
86
  ```bash
87
  curl -sL https://raw.githubusercontent.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/main/scripts/install_vllm_with_patches.sh | bash
 
126
  | ignored | `lm_head`, `embed_tokens`, norms, `ffn.gate`, `ffn.shared_experts`, attn `compressor`, attn `indexer`, `attn_sink`, `hc_*` | unquantized (BF16) | n/a |
127
  | MTP block (`mtp.0.*`) | all 799 keys | unquantized (BF16, preserved verbatim) | n/a |
128
 
129
+ Calibration corpus: HuggingFaceH4/ultrachat_200k train_sft, **64 samples** × max_seq_len 512 × batch_size 1, seed 42. RedHat's reference recipe uses 768 samples × 512 from the same corpus.
130
+
131
+ The 64-sample recipe was used due to time/compute constraints during initial bring-up (12× less coverage than RedHat). On the benchmarks measured here, GSM8K / HumanEval / IFEval / MMLU-Pro / AIME-non-truncated all land within noise of the reference. The visible cost of the reduced coverage is AIME truncation rate: 5/30 of our responses hit the 65K max_tokens cap on long reasoning traces vs 2/30 of RedHat's, which is consistent with looser calibration scales producing less-converging reasoning trajectories. A v0.2 recipe with 768 samples is planned.
132
 
133
  ## vLLM patches required
134
 
135
+ The artifact loads on vLLM mainline + these 5 patches. They're filed upstream and waiting on review. See [`docs/VLLM_SETUP_ISSUES.md`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/docs/VLLM_SETUP_ISSUES.md) for the exact diffs.
136
 
137
  1. PR [#43248](https://github.com/vllm-project/vllm/pull/43248) — `bool()` wrap on `is_static_input_scheme`
138
  2. PR [#43288](https://github.com/vllm-project/vllm/pull/43288) — `.get("scale_fmt", "ue8m0")` on missing key + BF16 `getattr` follow-up
 
151
 
152
  ## Reproduction
153
 
154
+ Full replication recipe in [`docs/recipes/nvfp4_fp8_mtp_replication.md`](https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/docs/recipes/nvfp4_fp8_mtp_replication.md) — covers the 14 gotchas (sm_103a vs sm_100a, calibration recipe, postprocess pipeline, vLLM build flags).
155
 
156
  ## Citation
157