DFLASH
Collection
1 item • Updated • 1
A DFlash draft model trained from Qwen3-32B using a EagleChat subset (English 200K + Chinese 200K) to accelerate speculative decoding.
This repository provides a DFlash draft model for Qwen3-32B. The draft model is intended to be used together with the target model in SpecForge, improving throughput (output tokens/sec) under standard speculative verification.
Qwen/Qwen3-32B/models/Qwen3-32Bsx-aicp/qwen3-32b-dflash-en-zh (or local path)fa3Environment: H100 (SM90), tp=4, attention=fa3, max_new_tokens=2048, drop_first_batch=true.
| Benchmark | Conc=1 | Conc=4 | Conc=32 |
|---|---|---|---|
| Math500 | 109.20 → 392.63 3.595× / L=5.564 |
409.44 → 1351.51 3.301× / L=5.582 |
2554.68 → 4554.81 1.783× / L=5.588 |
| HumanEval | 108.93 → 331.66 3.045× / L=4.769 |
407.34 → 1129.16 2.772× / L=4.756 |
2482.40 → 3632.36 1.463× / L=4.757 |
| MT-Bench | 109.19 → 233.75 2.141× / L=3.791 |
409.97 → 804.64 1.963× / L=3.852 |
2470.75 → 2767.16 1.120× / L=3.917 |
Format: baseline tok/s → DFlash tok/s; Speedup× / L(acceptance length).
python benchmark_sglang.py \
--tp-size 4 \
--target-model /models/Qwen3-32B \
--draft-model /path/to/draft_model \
--concurrencies 1,4,32 \
--dataset-name math500 \
--attention-backends fa3 \
--output-md sglang_results.md
Base model
Qwen/Qwen3-32B