Crash on first request on RTX Pro 6000 x8

#3
by koushd - opened

@lukealonso
Can you share your environment details? Which sglang are you using? A docker image or git checkout? I'm getting this crash immediately with either. CUDA 12.9

[2026-02-21 06:36:32] INFO: 192.168.2.124:50021 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-02-21 06:36:32 TP0] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 4.86, cuda graph: False
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112:112: _assert_async_cuda_kernel: _assert_async_cuda_kernel: block: [0: block: [0,0,0,0,0], thread: [0/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu], thread: [0,0:112,0,0,0: _assert_async_cuda_kernel] Assertion probability tensor contains either inf, nanor element < 0] Assertionprobability tensor contains either inf, nan or element < 0: block: [0failed. failed.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu,0:112,0: _assert_async_cuda_kernel], thread: [0,0: block: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0,0 failed.
], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
[2026-02-21 06:36:32 TP4] Scheduler hit an exception: Traceback (most recent call last):
File "/mnt/storage2/venv/sglang/python/sglang/srt/managers/scheduler.py", line 3169, in run_scheduler_process
scheduler.event_loop_overlap()
File "/mnt/storage2/venv/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage2/venv/sglang/python/sglang/srt/managers/scheduler.py", line 1173, in event_loop_overlap
pop_and_process()
File "/mnt/storage2/venv/sglang/python/sglang/srt/managers/scheduler.py", line 1144, in pop_and_process
self.process_batch_result(tmp_batch, tmp_result)
File "/mnt/storage2/venv/sglang/python/sglang/srt/managers/scheduler.py", line 2453, in process_batch_result
self.process_batch_result_decode(batch, result)
File "/mnt/storage2/venv/sglang/python/sglang/srt/managers/scheduler_output_processor_mixin.py", line 423, in process_batch_result_decode
result.copy_done.synchronize()
File "/mnt/storage2/venv/.venv/lib/python3.12/site-packages/torch/cuda/streams.py", line 231, in synchronize
super().synchronize()
torch.AcceleratorError: CUDA error: device-side assert triggered
Search for cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Also happening here the same thing.

I think this is an sglang bug. I have a workaround patch I use locally. You can also set temperature=0 as another (suboptimal) workaround.

Hey thanks for looking into this @lukealonso ! Would love to know what your local patch actually touches if you're able to share, even if it is a rough description, that would also help.

If it is the same problem I had with 6 RTX Pro 6000, disable KV cache quantization helps (--kv-cache-dtype bf16).

i opened this issue recently regarding this same issue https://github.com/sgl-project/sglang/issues/18954

The --kv-cache-dtype bf16 argument allowed me to run it in sglang, however the model seemed particularly brain damaged from this quantization compared to Kimi K2.5.

It may be worth having nvidia model optimizater do a quant at around 4.6-6.0 so it preserves the sensitive layers as fp8 to hopefully prevent this.

koushd changed discussion status to closed

Sign up or log in to comment