Finally! Thank you! DGX SPARK results are in...
I've been looking for this every day. I've been working on workflow for the same thing on 4x DGX Sparks with gemma-4-31B-it-NVFP4-turbo generated answers from a filtered subset of Nemotron and Ultrachat prompts, then regen training dflash against the same with dflash (SpecForge version) + logit distillation (ModelOpt version) and kind of a mix of both of those methods combined with some things inferred from the original Arxiv paper. Z-lab has been slow to release either assets.
It was painstaking to get DFlash decoding working for Gemma4 on the GB10, so much so that I think I might be the first (With your help of providing an actual spec.dlash), but I finally have some numbers.
Sharing a DGX Spark / GB10 path that does work for this draft, since it took a bit (a lot) of runtime patching to get there and the successful setup was not the obvious one from the stock launch path.
Test rig
- Single NVIDIA DGX Spark / GB10 (compute capability 12.1), single-node, TP=1
vllm/vllm-openai:nightlyvLLM 0.19.2rc1.dev21+g893611813- Verifier:
google/gemma-4-31B-itwith runtime--quantization fp8 - Draft:
RedHatAI/gemma-4-31B-it-speculator.dflash - Text-only mode
max_model_len=16384max_num_batched_tokens=16384gpu_memory_utilization=0.80
Path 1: vLLM nightly on Spark GB10
Working serve config was effectively:
VLLM_DISABLE_COMPILE_CACHE=1--quantization fp8--tensor-parallel-size 1--max-model-len 16384--max-num-batched-tokens 16384--gpu-memory-utilization 0.80--enforce-eager--limit-mm-per-prompt '{"image":0,"video":0}'--speculative-config '{"method":"dflash","model":"RedHatAI/gemma-4-31B-it-speculator.dflash","num_speculative_tokens":8}'
We set num_speculative_tokens=8 explicitly.
What actually took patching
The main issue was that the working path was not “just force FlashAttention globally.”
Gemma 4 on this stack forces Triton on the verifier side because of its heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing global FLASH_ATTN on the verifier path failed for us. Leaving both verifier and draft on Triton also failed, because the DFlash draft path is non-causal and Triton’s path hit the causal-only restriction.
The working setup was a split backend:
- verifier on Triton
- draft attention in
qwen3_dflashforced onto FlashAttention only for the draft path
We also had to patch the nightly wheel in three places:
- disable the prebuilt CUTLASS FP8 capability checks for GB10 so vLLM falls back to supported kernels
- advertise non-causal support in the Triton backend selector so DFlash can initialize
- force
FlashAttentionBackendonly insideqwen3_dflash
Results
Serve path is stable once patched. /v1/models came up cleanly and normal chat-completions requests worked.
For benchmarking we used:
vllm bench serve- dataset:
philschmid/mt-bench num-prompts=80max-concurrency=1hf-output-len=2048temperature=0
Against a plain non-DFlash baseline on the same verifier and same harness, recent live server metrics over the benchmark window are:
- baseline generation throughput:
5.53 tok/saverage - DFlash generation throughput:
15.44 tok/saverage - uplift: about
2.8x - DFlash acceptance rate:
28.8%average over the same window - observed DFlash generation range in that run:
9.9to28.1 tok/s - observed acceptance range:
15.1%to62.2%
TL;DR: this draft does run on Spark GB10 with the stock Google verifier, but today it is not a zero-patch path in vLLM nightly. The key was splitting verifier and draft attention backends instead of trying to run the whole stack on one backend.
This report is specifically for google/gemma-4-31B-it as verifier. We have not validated the NVFP4 turbo verifier path with this draft yet.
Happy to share the exact patch script / launchers / logs if useful.