dgx spark error ：torch.AcceleratorError: CUDA error: an illegal memory access was encountered ......

by wangweiweihw - opened 9 days ago

(Worker_TP0 pid=188) INFO 05-15 02:01:16 [deepseek_v4_mhc_warmup.py:224] Warming up DeepSeek V4 mHC TileLang kernels for token sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
(Worker_TP0 pid=188) 2026-05-15 02:01:22 [TileLang:tilelang.jit.kernel:INFO] (kernel.py:133): TileLang begins to compile kernel mhc_pre_big_fuse_tilelang with out_idx=None
(Worker_TP0 pid=188) 2026-05-15 02:01:27 [TileLang:tilelang.jit.kernel:INFO] (kernel.py:141): TileLang completes to compile kernel mhc_pre_big_fuse_tilelang
(Worker_TP0 pid=188) 2026-05-15 02:01:27 [TileLang:tilelang.jit.kernel:INFO] (kernel.py:133): TileLang begins to compile kernel mhc_pre_big_fuse_tilelang with out_idx=None
(Worker_TP0 pid=188) 2026-05-15 02:01:30 [TileLang:tilelang.jit.kernel:INFO] (kernel.py:141): TileLang completes to compile kernel mhc_pre_big_fuse_tilelang
(Worker_TP0 pid=188) 2026-05-15 02:01:31 [TileLang:tilelang.jit.kernel:INFO] (kernel.py:133): TileLang begins to compile kernel mhc_pre_big_fuse_tilelang with out_idx=None
(Worker_TP0 pid=188) 2026-05-15 02:01:34 [TileLang:tilelang.jit.kernel:INFO] (kernel.py:141): TileLang completes to compile kernel mhc_pre_big_fuse_tilelang
(Worker_TP0 pid=188) 2026-05-15 02:01:34 [TileLang:tilelang.jit.kernel:INFO] (kernel.py:133): TileLang begins to compile kernel mhc_pre_big_fuse_tilelang with out_idx=None
(Worker_TP0 pid=188) 2026-05-15 02:01:37 [TileLang:tilelang.jit.kernel:INFO] (kernel.py:141): TileLang completes to compile kernel mhc_pre_big_fuse_tilelang
(Worker_TP0 pid=188) INFO 05-15 02:01:41 [deepseek_v4_mhc_warmup.py:233] DeepSeek V4 mHC TileLang warmup finished in 24.61 seconds.
(Worker_TP0 pid=188) INFO 05-15 02:01:41 [kernel_warmup.py:179] Warming up DeepSeek V4 sparse MLA attention for mixed tokens=16 and prefill tokens=1024.
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] WorkerProc hit an exception.
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] Traceback (most recent call last):
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] output = func(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return func(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 606, in compile_or_warm_up_model
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] kernel_warmup(self)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py", line 223, in kernel_warmup
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] _deepseek_v4_sparse_mla_attention_warmup(worker)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py", line 194, in _deepseek_v4_sparse_mla_attention_warmup
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] runner._dummy_run(
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return func(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5698, in _dummy_run
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] outputs = self.model(
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in call
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return self.runnable(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return forward_call(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 1641, in forward
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] hidden_states = self.model(
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 520, in call
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return self.aot_compiled_fn(self, *args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 224, in call
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return self.fn(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 1344, in forward
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] def forward(
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 215, in call
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return self.optimized_call(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "", line 358, in execution_fn
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "", line 5, in __vllm_inlined_submods__1
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in call
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return self._op(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/deepseek_v4_attention.py", line 678, in deepseek_v4_attention
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] self.attention_impl(hidden_states, positions, out)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/deepseek_v4_attention.py", line 631, in attention_impl
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] self.mla_attn(q, kv, positions, output=out)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return forward_call(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/deepseek_v4_attention.py", line 1415, in forward
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] self._forward_prefill(
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/deepseek_v4_attention.py", line 1700, in _forward_prefill
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] dequantize_and_gather_k_cache(
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/deepseek_v4_ops/cache_utils.py", line 329, in dequantize_and_gather_k_cache
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] _dequantize_and_gather_k_kernel[(num_reqs, NUM_WORKERS)](
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 370, in
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 744, in run
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/driver.py", line 713, in call
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, self.launch_pdl,
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] Traceback (most recent call last):
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] output = func(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return func(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 606, in compile_or_warm_up_model
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] kernel_warmup(self)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py", line 223, in kernel_warmup
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] _deepseek_v4_sparse_mla_attention_warmup(worker)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py", line 194, in _deepseek_v4_sparse_mla_attention_warmup
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] runner._dummy_run(
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return func(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5698, in _dummy_run
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] outputs = self.model(
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in call
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return self.runnable(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return forward_call(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 1641, in forward
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] hidden_states = self.model(
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 520, in call
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return self.aot_compiled_fn(self, *args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 224, in call
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return self.fn(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 1344, in forward
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] def forward(
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 215, in call
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return self.optimized_call(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "", line 358, in execution_fn
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "", line 5, in __vllm_inlined_submods__1
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in call
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return self._op(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/deepseek_v4_attention.py", line 678, in deepseek_v4_attention
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] self.attention_impl(hidden_states, positions, out)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/deepseek_v4_attention.py", line 631, in attention_impl
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] self.mla_attn(q, kv, positions, output=out)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return forward_call(*args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/deepseek_v4_attention.py", line 1415, in forward
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] self._forward_prefill(
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/deepseek_v4_attention.py", line 1700, in _forward_prefill
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] dequantize_and_gather_k_cache(
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/deepseek_v4_ops/cache_utils.py", line 329, in dequantize_and_gather_k_cache
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] _dequantize_and_gather_k_kernel[(num_reqs, NUM_WORKERS)](
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 370, in
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 744, in run
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/driver.py", line 713, in call
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, self.launch_pdl,
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962] RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
(Worker_TP0 pid=188) ERROR 05-15 02:01:45 [multiproc_executor.py:962]
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] EngineCore failed to start.
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] Traceback (most recent call last):
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] return func(*args, **kwargs)
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 876, in init
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] super().init(
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 128, in init
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] return func(*args, **kwargs)
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 283, in _initialize_kv_caches
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 124, in initialize_from_config
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] compilation_times: list[CompilationTimes] = self.collective_rpc(
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 403, in collective_rpc
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] return future if non_block else future.result()
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] ^^^^^^^^^^^^^^^
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 90, in result
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] return super().result()
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] ^^^^^^^^^^^^^^^^
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] return self.__get_result()
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] raise self._exception
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 94, in _wait_for_response
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] response = self.aggregate(self.get_response())
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 390, in get_response
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] raise RuntimeError(
(EngineCore pid=141) ERROR 05-15 02:01:45 [core.py:1136] RuntimeError: Worker failed with error 'Triton Error [CUDA]: an illegal memory access was encountered', please check the stack trace above for the root cause
(Worker_TP0 pid=188) WARNING 05-15 02:01:45 [multiproc_executor.py:884] WorkerProc was terminated
(Worker_TP0 pid=188) Process VllmWorker-0:
(Worker_TP0 pid=188) Traceback (most recent call last):
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 886, in worker_main
(Worker_TP0 pid=188) raise e
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 861, in worker_main
(Worker_TP0 pid=188) worker.worker_busy_loop()
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 948, in worker_busy_loop
(Worker_TP0 pid=188) method, args, kwargs, output_rank = self.rpc_broadcast_mq.dequeue(
(Worker_TP0 pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 755, in dequeue
(Worker_TP0 pid=188) with self.acquire_read(timeout, indefinite) as buf:
(Worker_TP0 pid=188) File "/usr/lib/python3.12/contextlib.py", line 137, in enter
(Worker_TP0 pid=188) return next(self.gen)
(Worker_TP0 pid=188) ^^^^^^^^^^^^^^
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 674, in acquire_read
(Worker_TP0 pid=188) self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 186, in wait
(Worker_TP0 pid=188) events = dict(self.poller.poll(timeout=timeout_ms))
(Worker_TP0 pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/zmq/sugar/poll.py", line 106, in poll
(Worker_TP0 pid=188) return zmq_poll(self.sockets, timeout=timeout)
(Worker_TP0 pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=188) File "zmq/backend/cython/_zmq.py", line 1680, in zmq.backend.cython._zmq.zmq_poll
(Worker_TP0 pid=188) _check_rc(rc)
(Worker_TP0 pid=188) ^^^^^^^^^^^
(Worker_TP0 pid=188) File "zmq/backend/cython/_zmq.py", line 179, in zmq.backend.cython._zmq._check_rc
(Worker_TP0 pid=188) PyErr_CheckSignals()
(Worker_TP0 pid=188) ^^^^^^^^^^^
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 808, in signal_handler
(Worker_TP0 pid=188) raise SystemExit()
(Worker_TP0 pid=188) SystemExit
(Worker_TP0 pid=188)
(Worker_TP0 pid=188) During handling of the above exception, another exception occurred:
(Worker_TP0 pid=188)
(Worker_TP0 pid=188) Traceback (most recent call last):
(Worker_TP0 pid=188) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(Worker_TP0 pid=188) self.run()
(Worker_TP0 pid=188) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(Worker_TP0 pid=188) self._target(*self._args, **self._kwargs)
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 895, in worker_main
(Worker_TP0 pid=188) worker.shutdown()
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 760, in shutdown
(Worker_TP0 pid=188) self.worker.shutdown()
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 212, in shutdown
(Worker_TP0 pid=188) self.worker.shutdown()
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 1049, in shutdown
(Worker_TP0 pid=188) model_runner.shutdown()
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6088, in shutdown
(Worker_TP0 pid=188) self._cleanup_profiling_kv_cache()
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6096, in _cleanup_profiling_kv_cache
(Worker_TP0 pid=188) torch.accelerator.synchronize()
(Worker_TP0 pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/accelerator/init.py", line 263, in synchronize
(Worker_TP0 pid=188) torch._C._accelerator_synchronizeDevice(device_index)
(Worker_TP0 pid=188) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(Worker_TP0 pid=188) Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (Worker_TP0 pid=188) Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker_TP0 pid=188)
[rank0]:[W515 02:01:55.582360096 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore pid=141) ERROR 05-15 02:01:59 [multiproc_executor.py:283] Worker proc VllmWorker-0 died unexpectedly, shutting down executor.
(EngineCore pid=141) Process EngineCore:
(EngineCore pid=141) Traceback (most recent call last):
(EngineCore pid=141) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=141) self.run()
(EngineCore pid=141) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=141) self._target(*self._args, **self._kwargs)
(EngineCore pid=141) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1140, in run_engine_core
(EngineCore pid=141) raise e
(EngineCore pid=141) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=141) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=141) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=141) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=141) return func(*args, **kwargs)
(EngineCore pid=141) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=141) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 876, in init
(EngineCore pid=141) super().init(
(EngineCore pid=141) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 128, in init
(EngineCore pid=141) kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=141) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=141) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=141) return func(*args, **kwargs)
(EngineCore pid=141) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=141) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 283, in _initialize_kv_caches
(EngineCore pid=141) self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=141) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 124, in initialize_from_config
(EngineCore pid=141) compilation_times: list[CompilationTimes] = self.collective_rpc(
(EngineCore pid=141) ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=141) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 403, in collective_rpc
(EngineCore pid=141) return future if non_block else future.result()
(EngineCore pid=141) ^^^^^^^^^^^^^^^
(EngineCore pid=141) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 90, in result
(EngineCore pid=141) return super().result()
(EngineCore pid=141) ^^^^^^^^^^^^^^^^
(EngineCore pid=141) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=141) return self.__get_result()
(EngineCore pid=141) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=141) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=141) raise self._exception
(EngineCore pid=141) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 94, in _wait_for_response
(EngineCore pid=141) response = self.aggregate(self.get_response())
(EngineCore pid=141) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=141) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 390, in get_response
(EngineCore pid=141) raise RuntimeError(
(EngineCore pid=141) RuntimeError: Worker failed with error 'Triton Error [CUDA]: an illegal memory access was encountered', please check the stack trace above for the root cause

pastapaul

Canada Quant Labs org 7 days ago

Thanks for the log. I just confirmed that jasl/vllm@ds4-sm120-experimental hasn't moved since our last validated build — the tip is still afb7041 from 2026-05-06, all six -experimental-only commits we run against are intact, no force-push. So the kernel code in cache_utils.py:329 (_dequantize_and_gather_k_kernel) is identical to what we've been serving 1M-context against without issue. That makes a regression unlikely — it's almost certainly a config mismatch.
The most common cause of an illegal memory access in that kernel during sparse-MLA warmup is missing the env var VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4. Without it the kernel falls back to a default block size that doesn't match V4-Flash's head dim, and you can read out of bounds on the gather. Worth checking that one first. Our full env block is in QUICKSTART_DUAL_SPARK.md §4 — if you're launching the container manually, please make sure all of those env vars are set, not just a subset.
To narrow it down further, could you share:

The exact vllm serve ... command (or docker run ... if you're using our image)
docker exec vllm_node sh -c 'env | grep -E "^(VLLM|NCCL|TILELANG)_" | sort'
Single Spark (TP=1) or dual Spark (TP=2)?
docker exec vllm_node cat /workspace/build-metadata.yaml (if you built with our bootstrap script — gives the exact vllm SHA and build date of your image)

The bootstrap script on main now writes that metadata to /tmp/dsv4-spark-build-metadata-*.yaml on every successful boot (4c64828), so next time you rebuild it'll be there.

wangweiweihw

6 days ago

thanks!
I am using the script bootstrap_dsv4_spark.sh to run it. The environment variable VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4 is present. I haven't found any other way to solve this issue yet, so I have temporarily disabled the warm-up function: -e "VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP=0". Now it can run normally, but the maximum TPS can only reach around 13, which is much smaller than expected. Moreover, when the number of input tokens reaches 128k, VLLM will abnormally exit.
Some error logs：
(EngineCore pid=141) ERROR 05-17 04:52:40 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 388, in get_response
(EngineCore pid=141) ERROR 05-17 04:52:40 [core.py:1138] raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore pid=141) ERROR 05-17 04:52:40 [core.py:1138] TimeoutError: RPC call to sample_tokens timed out.
(Worker_TP0 pid=188) 2026-05-17 04:45:03 [TileLang:tilelang.jit.kernel:INFO] (kernel.py:133): TileLang begins to compile kernel mhc_pre_big_fuse_tilelang with out_idx=None
(Worker_TP0 pid=188) 2026-05-17 04:45:08 [TileLang:tilelang.jit.kernel:INFO] (kernel.py:141): TileLang completes to compile kernel mhc_pre_big_fuse_tilelang
(Worker_TP0 pid=188) INFO 05-17 04:52:40 [multiproc_executor.py:775] Parent process exited, terminating worker queues
(APIServer pid=1) ERROR 05-17 04:52:40 [async_llm.py:704] AsyncLLM output_handler failed.

pastapaul

Canada Quant Labs org 6 days ago

13 tps sounds about right on dual sparks...im working on a full MTP quant version which should speed things up substantially

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment