Ampere (sm_86) compatibility findings — two-path blocker on 2× RTX 3090

by wasifb - opened 18 days ago

Flagging what we hit trying this draft on Ampere, since the card notes "other hardware validation pending." Both obvious paths are blocked on 2× RTX 3090 (sm_86, CUDA 12.9). Sharing in case it helps your Ampere validation pass.

Test rig

2× NVIDIA RTX 3090 (Ampere sm_86), PCIe-only, no NVLink
CUDA 12.9 (container) / 13.2 (host), Driver 595.58.03
Target: Intel/gemma-4-31B-it-int4-AutoRound at TP=2 (working at 61 t/s without DFlash)
Draft: RedHatAI/gemma-4-31B-it-speculator.dflash (this model)

Path 1: vLLM nightly (v0.19.2rc1.dev21+g893611813)

Config:

--speculative-config '{"method":"dflash","model":".../gemma-4-31B-it-speculator-dflash","num_speculative_tokens":8}'

Note: num_speculative_tokens is NOT auto-populated from the draft's speculators_config.proposal_methods[0].speculative_tokens=8 — SpeculativeConfig raises a pydantic ValidationError without it. Minor auto-detect gap; setting it explicitly works.

The deeper issue: no Ampere-compatible attention backend supports both DFlash's non-causal block-parallel drafting AND Gemma-4's head_dim=256. Attempted backends:

Backend	Non-causal?	head_dim=256 on sm_86?
FLASH_ATTN (FA2)	OK	`head_size not supported`
FLASH_ATTN_DIFFKV	(didn't reach)	`head_size not supported`
FLASHINFER	`non-causal attention not supported`	OK
TRITON_ATTN	same	OK
FLEX_ATTENTION	same	OK
TREE_ATTN	same	OK

FA2's head_dim=256 support is Hopper-only in the standard wheel, which matches the card's "validated on H100" statement.

Path 2: llama.cpp PR #22105 (ruixiang63/llama.cpp dflash branch)

We built the PR locally (CUDA sm_86, clean build) and tried to convert this draft to GGUF via the PR's convert_hf_to_gguf.py --target-model-dir.

Hit two issues:

d2t / t2d not handled by DFlashModel. The PR's EAGLE3 path stashes d2t as int64 and drops t2d; DFlashModel (subclass of Qwen3Model) doesn't replicate that logic. Local patch that catches d2t / t2d in DFlashModel.modify_tensors lets conversion proceed.
Blocker: gguf.MODEL_ARCH.DFLASH tensor list is missing TOKEN_EMBD and OUTPUT. Only registers DFLASH_FC + DFLASH_HIDDEN_NORM plus transformer layers. Works for z-lab-format drafts (e.g. z-lab/Qwen3.6-35B-A3B-DFlash — we converted that fine) because those share the target's vocab and don't ship their own embedding/lm_head.

Your speculators-format draft has its own reduced-vocab embeddings (draft_vocab_size=32000 vs Gemma-4's full vocab) + embed_tokens.weight + lm_head.weight + d2t/t2d remap. Conversion fails at model.embed_tokens.weight because DFLASH arch has nowhere to register it. Fixing upstream needs arch changes + inference-side support for reduced-vocab-draft + d2t-remap logic, not just a converter tweak.

What would unblock Ampere

vLLM path: non-causal support added to FLASHINFER or FLEX_ATTENTION on Ampere, OR head_dim=256 enabled in FA2/FA3 for sm_86
llama.cpp path: DFLASH arch in gguf.MODEL_ARCH gains TOKEN_EMBD / OUTPUT registrations, plus inference code to use the draft's own embeddings with d2t remap (upstream in PR #22105 or a follow-up)

References

llama.cpp DFlash PR: https://github.com/ggml-org/llama.cpp/pull/22105
vLLM DFlash test (speculators auto-detect pattern): https://github.com/vllm-project/vllm/blob/main/tests/v1/spec_decode/test_speculators_dflash.py
Our vLLM Marlin pad-on-load PR (unrelated but from the same test session): https://github.com/vllm-project/vllm/pull/40361

Happy to share logs / repro scripts if useful. Great work on the draft — looking forward to Ampere support so we can actually bench it.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment