Finally! Thank you! DGX SPARK results are in...

by meanaverage - opened 17 days ago

I've been looking for this every day. I've been working on workflow for the same thing on 4x DGX Sparks with gemma-4-31B-it-NVFP4-turbo generated answers from a filtered subset of Nemotron and Ultrachat prompts, then regen training dflash against the same with dflash (SpecForge version) + logit distillation (ModelOpt version) and kind of a mix of both of those methods combined with some things inferred from the original Arxiv paper. Z-lab has been slow to release either assets.

It was painstaking to get DFlash decoding working for Gemma4 on the GB10, so much so that I think I might be the first (With your help of providing an actual spec.dlash), but I finally have some numbers.

Sharing a DGX Spark / GB10 path that does work for this draft, since it took a bit (a lot) of runtime patching to get there and the successful setup was not the obvious one from the stock launch path.

Test rig

Single NVIDIA DGX Spark / GB10 (compute capability 12.1), single-node, TP=1
vllm/vllm-openai:nightly
vLLM 0.19.2rc1.dev21+g893611813
Verifier: google/gemma-4-31B-it with runtime --quantization fp8
Draft: RedHatAI/gemma-4-31B-it-speculator.dflash
Text-only mode
max_model_len=16384
max_num_batched_tokens=16384
gpu_memory_utilization=0.80

Path 1: vLLM nightly on Spark GB10

Working serve config was effectively:

VLLM_DISABLE_COMPILE_CACHE=1
--quantization fp8
--tensor-parallel-size 1
--max-model-len 16384
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.80
--enforce-eager
--limit-mm-per-prompt '{"image":0,"video":0}'
--speculative-config '{"method":"dflash","model":"RedHatAI/gemma-4-31B-it-speculator.dflash","num_speculative_tokens":8}'

We set num_speculative_tokens=8 explicitly.

What actually took patching

The main issue was that the working path was not “just force FlashAttention globally.”

Gemma 4 on this stack forces Triton on the verifier side because of its heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing global FLASH_ATTN on the verifier path failed for us. Leaving both verifier and draft on Triton also failed, because the DFlash draft path is non-causal and Triton’s path hit the causal-only restriction.

The working setup was a split backend:

verifier on Triton
draft attention in qwen3_dflash forced onto FlashAttention only for the draft path

We also had to patch the nightly wheel in three places:

disable the prebuilt CUTLASS FP8 capability checks for GB10 so vLLM falls back to supported kernels
advertise non-causal support in the Triton backend selector so DFlash can initialize
force FlashAttentionBackend only inside qwen3_dflash

Results

Serve path is stable once patched. /v1/models came up cleanly and normal chat-completions requests worked.

For benchmarking we used:

vllm bench serve
dataset: philschmid/mt-bench
num-prompts=80
max-concurrency=1
hf-output-len=2048
temperature=0

Against a plain non-DFlash baseline on the same verifier and same harness, recent live server metrics over the benchmark window are:

baseline generation throughput: 5.53 tok/s average
DFlash generation throughput: 15.44 tok/s average
uplift: about 2.8x
DFlash acceptance rate: 28.8% average over the same window
observed DFlash generation range in that run: 9.9 to 28.1 tok/s
observed acceptance range: 15.1% to 62.2%

TL;DR: this draft does run on Spark GB10 with the stock Google verifier, but today it is not a zero-patch path in vLLM nightly. The key was splitting verifier and draft attention backends instead of trying to run the whole stack on one backend.

This report is specifically for google/gemma-4-31B-it as verifier. We have not validated the NVFP4 turbo verifier path with this draft yet.

Happy to share the exact patch script / launchers / logs if useful.

meanaverage changed discussion title from Finally! Thank you! to Finally! Thank you! DGX SPARK results are in... 17 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment