Ampere (sm_86) compatibility findings — two-path blocker on 2× RTX 3090

#1
by wasifb - opened

Flagging what we hit trying this draft on Ampere, since the card notes "other hardware validation pending." Both obvious paths are blocked on 2× RTX 3090 (sm_86, CUDA 12.9). Sharing in case it helps your Ampere validation pass.

Test rig

  • 2× NVIDIA RTX 3090 (Ampere sm_86), PCIe-only, no NVLink
  • CUDA 12.9 (container) / 13.2 (host), Driver 595.58.03
  • Target: Intel/gemma-4-31B-it-int4-AutoRound at TP=2 (working at 61 t/s without DFlash)
  • Draft: RedHatAI/gemma-4-31B-it-speculator.dflash (this model)

Path 1: vLLM nightly (v0.19.2rc1.dev21+g893611813)

Config:

--speculative-config '{"method":"dflash","model":".../gemma-4-31B-it-speculator-dflash","num_speculative_tokens":8}'

Note: num_speculative_tokens is NOT auto-populated from the draft's speculators_config.proposal_methods[0].speculative_tokens=8SpeculativeConfig raises a pydantic ValidationError without it. Minor auto-detect gap; setting it explicitly works.

The deeper issue: no Ampere-compatible attention backend supports both DFlash's non-causal block-parallel drafting AND Gemma-4's head_dim=256. Attempted backends:

Backend Non-causal? head_dim=256 on sm_86?
FLASH_ATTN (FA2) OK head_size not supported
FLASH_ATTN_DIFFKV (didn't reach) head_size not supported
FLASHINFER non-causal attention not supported OK
TRITON_ATTN same OK
FLEX_ATTENTION same OK
TREE_ATTN same OK

FA2's head_dim=256 support is Hopper-only in the standard wheel, which matches the card's "validated on H100" statement.

Path 2: llama.cpp PR #22105 (ruixiang63/llama.cpp dflash branch)

We built the PR locally (CUDA sm_86, clean build) and tried to convert this draft to GGUF via the PR's convert_hf_to_gguf.py --target-model-dir.

Hit two issues:

  1. d2t / t2d not handled by DFlashModel. The PR's EAGLE3 path stashes d2t as int64 and drops t2d; DFlashModel (subclass of Qwen3Model) doesn't replicate that logic. Local patch that catches d2t / t2d in DFlashModel.modify_tensors lets conversion proceed.

  2. Blocker: gguf.MODEL_ARCH.DFLASH tensor list is missing TOKEN_EMBD and OUTPUT. Only registers DFLASH_FC + DFLASH_HIDDEN_NORM plus transformer layers. Works for z-lab-format drafts (e.g. z-lab/Qwen3.6-35B-A3B-DFlash — we converted that fine) because those share the target's vocab and don't ship their own embedding/lm_head.

    Your speculators-format draft has its own reduced-vocab embeddings (draft_vocab_size=32000 vs Gemma-4's full vocab) + embed_tokens.weight + lm_head.weight + d2t/t2d remap. Conversion fails at model.embed_tokens.weight because DFLASH arch has nowhere to register it. Fixing upstream needs arch changes + inference-side support for reduced-vocab-draft + d2t-remap logic, not just a converter tweak.

What would unblock Ampere

  • vLLM path: non-causal support added to FLASHINFER or FLEX_ATTENTION on Ampere, OR head_dim=256 enabled in FA2/FA3 for sm_86
  • llama.cpp path: DFLASH arch in gguf.MODEL_ARCH gains TOKEN_EMBD / OUTPUT registrations, plus inference code to use the draft's own embeddings with d2t remap (upstream in PR #22105 or a follow-up)

References

Happy to share logs / repro scripts if useful. Great work on the draft — looking forward to Ampere support so we can actually bench it.

Sign up or log in to comment