Ampere (sm_86) compatibility findings — two-path blocker on 2× RTX 3090
Flagging what we hit trying this draft on Ampere, since the card notes "other hardware validation pending." Both obvious paths are blocked on 2× RTX 3090 (sm_86, CUDA 12.9). Sharing in case it helps your Ampere validation pass.
Test rig
- 2× NVIDIA RTX 3090 (Ampere sm_86), PCIe-only, no NVLink
- CUDA 12.9 (container) / 13.2 (host), Driver 595.58.03
- Target:
Intel/gemma-4-31B-it-int4-AutoRoundat TP=2 (working at 61 t/s without DFlash) - Draft:
RedHatAI/gemma-4-31B-it-speculator.dflash(this model)
Path 1: vLLM nightly (v0.19.2rc1.dev21+g893611813)
Config:
--speculative-config '{"method":"dflash","model":".../gemma-4-31B-it-speculator-dflash","num_speculative_tokens":8}'
Note: num_speculative_tokens is NOT auto-populated from the draft's speculators_config.proposal_methods[0].speculative_tokens=8 — SpeculativeConfig raises a pydantic ValidationError without it. Minor auto-detect gap; setting it explicitly works.
The deeper issue: no Ampere-compatible attention backend supports both DFlash's non-causal block-parallel drafting AND Gemma-4's head_dim=256. Attempted backends:
| Backend | Non-causal? | head_dim=256 on sm_86? |
|---|---|---|
| FLASH_ATTN (FA2) | OK | head_size not supported |
| FLASH_ATTN_DIFFKV | (didn't reach) | head_size not supported |
| FLASHINFER | non-causal attention not supported |
OK |
| TRITON_ATTN | same | OK |
| FLEX_ATTENTION | same | OK |
| TREE_ATTN | same | OK |
FA2's head_dim=256 support is Hopper-only in the standard wheel, which matches the card's "validated on H100" statement.
Path 2: llama.cpp PR #22105 (ruixiang63/llama.cpp dflash branch)
We built the PR locally (CUDA sm_86, clean build) and tried to convert this draft to GGUF via the PR's convert_hf_to_gguf.py --target-model-dir.
Hit two issues:
d2t/t2dnot handled by DFlashModel. The PR's EAGLE3 path stashesd2tas int64 and dropst2d; DFlashModel (subclass of Qwen3Model) doesn't replicate that logic. Local patch that catchesd2t/t2dinDFlashModel.modify_tensorslets conversion proceed.Blocker:
gguf.MODEL_ARCH.DFLASHtensor list is missingTOKEN_EMBDandOUTPUT. Only registersDFLASH_FC+DFLASH_HIDDEN_NORMplus transformer layers. Works for z-lab-format drafts (e.g.z-lab/Qwen3.6-35B-A3B-DFlash— we converted that fine) because those share the target's vocab and don't ship their own embedding/lm_head.Your speculators-format draft has its own reduced-vocab embeddings (
draft_vocab_size=32000vs Gemma-4's full vocab) +embed_tokens.weight+lm_head.weight+ d2t/t2d remap. Conversion fails atmodel.embed_tokens.weightbecause DFLASH arch has nowhere to register it. Fixing upstream needs arch changes + inference-side support for reduced-vocab-draft + d2t-remap logic, not just a converter tweak.
What would unblock Ampere
- vLLM path: non-causal support added to FLASHINFER or FLEX_ATTENTION on Ampere, OR head_dim=256 enabled in FA2/FA3 for sm_86
- llama.cpp path: DFLASH arch in
gguf.MODEL_ARCHgains TOKEN_EMBD / OUTPUT registrations, plus inference code to use the draft's own embeddings with d2t remap (upstream in PR #22105 or a follow-up)
References
- llama.cpp DFlash PR: https://github.com/ggml-org/llama.cpp/pull/22105
- vLLM DFlash test (speculators auto-detect pattern): https://github.com/vllm-project/vllm/blob/main/tests/v1/spec_decode/test_speculators_dflash.py
- Our vLLM Marlin pad-on-load PR (unrelated but from the same test session): https://github.com/vllm-project/vllm/pull/40361
Happy to share logs / repro scripts if useful. Great work on the draft — looking forward to Ampere support so we can actually bench it.