z-lab/gpt-oss-120b-DFlash · doesn't work on rtx pro 6000 blackwell

doesn't work on rtx pro 6000 blackwell

by bhat1 - opened 12 days ago

Driver Version: 595.45.04
CUDA Version: 13.2
flash_attn 2.8.4
triton 3.6.0
flashinfer-cubin 0.6.7
flashinfer-python 0.6.7
vllm 0.19.1rc1.dev110+gb55d830ec

Error:
got: (EngineCore pid=1963401) ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=64, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None , use_mla=False, has_sink=True, use_sparse=False, use_mm_prefix=False, use_per_head_quant_scales=False, attn_type=AttentionType.DECODER, use_non_causal=True). Reasons: {FLASH_ATTN: [attention sinks not supported], FLASHINFER: [attention sinks not supported, non-causal attention not supported], TRITON_ATTN: [non-causal attention not supported], FLEX_ATTENTION : [attention sinks not supported, non-causal attention not supported]}. [rank0]:[W408 18:40:02.418010393 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, p lease see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

start command:
vllm serve openai/gpt-oss-120b
--speculative-config '{"method": "dflash", "model": "z-lab/gpt-oss-120b-DFlash", "num_speculative_tokens": 9}'
--max-num-batched-tokens 32768

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment