Failures on DGX Spark under vLLM
docker run
--privileged --gpus all -it --rm --network host --ipc=host --oom-score-adj 500
-v ~/.cache/huggingface:/root/.cache/huggingface
-v ~/.cache/pip:/root/.cache/pip
vllm-node
bash -lc 'python -m pip install -U --pre vllm &&
vllm --version &&
vllm serve Sehyo/Qwen3.5-122B-A10B-NVFP4
--trust-remote-code
--tensor-parallel-size 1
--port 8000 --host 0.0.0.0
--gpu-memory-utilization 0.80
--max-model-len 131072'
APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
(APIServer pid=1) INFO: 127.0.0.1:59056 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 127.0.0.1:59062 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 127.0.0.1:38106 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=287) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (64). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=287) return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=287) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (64). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=287) return fn(*contiguous_args, **contiguous_kwargs)
(APIServer pid=1) INFO 03-02 22:25:47 [loggers.py:259] Engine 000: Avg prompt throughput: 1.6 tokens/s, Avg generation throughput: 7.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 03-02 22:25:57 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 14.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 03-02 22:26:07 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 03-02 22:26:17 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 03-02 22:26:27 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 03-02 22:26:37 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.16.1rc1.dev23+gb6d5a1729.d20260226) with config: model='Sehyo/Qwen3.5-122B-A10B-NVFP4', speculative_config=None, tokenizer='Sehyo/Qwen3.5-122B-A10B-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Sehyo/Qwen3.5-122B-A10B-NVFP4, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '/root/.cache/vllm/torch_compile_cache/d52acd74ce', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': '/root/.cache/vllm/torch_compile_cache/d52acd74ce/rank_0_0/backbone', 'fast_moe_cold_start': True, 'static_all_moe_layers': ['language_model.model.layers.0.mlp.experts', 'language_model.model.layers.1.mlp.experts', 'language_model.model.layers.2.mlp.experts', 'language_model.model.layers.3.mlp.experts', 'language_model.model.layers.4.mlp.experts', 'language_model.model.layers.5.mlp.experts', 'language_model.model.layers.6.mlp.experts', 'language_model.model.layers.7.mlp.experts', 'language_model.model.layers.8.mlp.experts', 'language_model.model.layers.9.mlp.experts', 'language_model.model.layers.10.mlp.experts', 'language_model.model.layers.11.mlp.experts', 'language_model.model.layers.12.mlp.experts', 'language_model.model.layers.13.mlp.experts', 'language_model.model.layers.14.mlp.experts', 'language_model.model.layers.15.mlp.experts', 'language_model.model.layers.16.mlp.experts', 'language_model.model.layers.17.mlp.experts', 'language_model.model.layers.18.mlp.experts', 'language_model.model.layers.19.mlp.experts', 'language_model.model.layers.20.mlp.experts', 'language_model.model.layers.21.mlp.experts', 'language_model.model.layers.22.mlp.experts', 'language_model.model.layers.23.mlp.experts', 'language_model.model.layers.24.mlp.experts', 'language_model.model.layers.25.mlp.experts', 'language_model.model.layers.26.mlp.experts', 'language_model.model.layers.27.mlp.experts', 'language_model.model.layers.28.mlp.experts', 'language_model.model.layers.29.mlp.experts', 'language_model.model.layers.30.mlp.experts', 'language_model.model.layers.31.mlp.experts', 'language_model.model.layers.32.mlp.experts', 'language_model.model.layers.33.mlp.experts', 'language_model.model.layers.34.mlp.experts', 'language_model.model.layers.35.mlp.experts', 'language_model.model.layers.36.mlp.experts', 'language_model.model.layers.37.mlp.experts', 'language_model.model.layers.38.mlp.experts', 'language_model.model.layers.39.mlp.experts', 'language_model.model.layers.40.mlp.experts', 'language_model.model.layers.41.mlp.experts', 'language_model.model.layers.42.mlp.experts', 'language_model.model.layers.43.mlp.experts', 'language_model.model.layers.44.mlp.experts', 'language_model.model.layers.45.mlp.experts', 'language_model.model.layers.46.mlp.experts', 'language_model.model.layers.47.mlp.experts']},
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-874c384bb190e7e3-bf5cf1de'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[None],num_computed_tokens=[903],num_output_tokens=[888]), num_scheduled_tokens={chatcmpl-874c384bb190e7e3-bf5cf1de: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.007648183556405397, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] Traceback (most recent call last):
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1071, in run_engine_core
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] engine_core.run_busy_loop()
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1106, in run_busy_loop
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] self._process_engine_step()
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1137, in _process_engine_step
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 493, in step_with_batch_queue
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] model_output = future.result()
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] return self.__get_result()
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] raise self._exception
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] result = self.fn(*self.args, **self.kwargs)
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 250, in get_output
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] self.async_copy_ready_event.synchronize()
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] torch.AcceleratorError: CUDA error: an illegal instruction was encountered
(EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] Search for cudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] Compile with TORCH_USE_CUDA_DSAto enable device-side assertions. (EngineCore_DP0 pid=287) ERROR 03-02 22:26:46 [core.py:1080] (EngineCore_DP0 pid=287) Process EngineCore_DP0: (EngineCore_DP0 pid=287) Traceback (most recent call last): (EngineCore_DP0 pid=287) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=287) self.run() (EngineCore_DP0 pid=287) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=287) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=287) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core (EngineCore_DP0 pid=287) raise e (EngineCore_DP0 pid=287) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1071, in run_engine_core (EngineCore_DP0 pid=287) engine_core.run_busy_loop() (EngineCore_DP0 pid=287) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1106, in run_busy_loop (EngineCore_DP0 pid=287) self._process_engine_step() (EngineCore_DP0 pid=287) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1137, in _process_engine_step (EngineCore_DP0 pid=287) outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=287) ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=287) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 493, in step_with_batch_queue (EngineCore_DP0 pid=287) model_output = future.result() (EngineCore_DP0 pid=287) ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=287) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result (EngineCore_DP0 pid=287) return self.__get_result() (EngineCore_DP0 pid=287) ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=287) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=287) raise self._exception (EngineCore_DP0 pid=287) File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run (EngineCore_DP0 pid=287) result = self.fn(*self.args, **self.kwargs) (EngineCore_DP0 pid=287) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=287) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 250, in get_output (EngineCore_DP0 pid=287) self.async_copy_ready_event.synchronize() (EngineCore_DP0 pid=287) torch.AcceleratorError: CUDA error: an illegal instruction was encountered (EngineCore_DP0 pid=287) Search forcudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=287) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=287) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=287) Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(EngineCore_DP0 pid=287)
(APIServer pid=1) ERROR 03-02 22:26:46 [async_llm.py:702] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 03-02 22:26:46 [async_llm.py:702] Traceback (most recent call last):
(APIServer pid=1) ERROR 03-02 22:26:46 [async_llm.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 658, in output_handler
(APIServer pid=1) ERROR 03-02 22:26:46 [async_llm.py:702] outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 03-02 22:26:46 [async_llm.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 03-02 22:26:46 [async_llm.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 917, in get_output_async
(APIServer pid=1) ERROR 03-02 22:26:46 [async_llm.py:702] raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 03-02 22:26:46 [async_llm.py:702] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] Error in chat completion stream generator.
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] Traceback (most recent call last):
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 714, in chat_completion_stream_generator
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] async for res in result_generator:
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 581, in generate
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] out = q.get_nowait() or await q.get()
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] ^^^^^^^^^^^^^
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] raise output
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 658, in output_handler
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 917, in get_output_async
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 03-02 22:26:46 [serving.py:1380] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=1) INFO: 127.0.0.1:33120 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 127.0.0.1:33132 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 127.0.0.1:33144 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO 03-02 22:26:47 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 12.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
[rank0]:[W302 22:26:48.485996121 CUDAGuardImpl.h:122] Warning: CUDA warning: an illegal instruction was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::AcceleratorError'
what(): CUDA error: an illegal instruction was encountered
Search for cudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from currentStreamCaptureStatusMayInitCtx at /opt/pytorch/pytorch/c10/cuda/CUDAGraphsC10Utils.h:71 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xd4 (0xe5bfed603a04 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: + 0x43e698 (0xe5bfed6fe698 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1bc (0xe5bfed6fe90c in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x108a498 (0xe5bfee3ea498 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x474654 (0xe5bfed5e4654 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #5: c10::TensorImpl::~TensorImpl() + 0x14 (0xe5bfed5a2244 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #6: + 0x5f42ac (0xe5c0161342ac in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #7: + 0xb8f14c (0xe5c0166cf14c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #8: VLLM::EngineCore() [0x523cb4]
frame #9: VLLM::EngineCore() [0x4f9db4]
frame #10: VLLM::EngineCore() [0x523b30]
frame #11: VLLM::EngineCore() [0x4d593c]
frame #12: VLLM::EngineCore() [0x5a645c]
frame #13: VLLM::EngineCore() [0x5a6464]
frame #14: VLLM::EngineCore() [0x5a6464]
frame #15: VLLM::EngineCore() [0x5a6464]
frame #16: VLLM::EngineCore() [0x5a6464]
frame #17: VLLM::EngineCore() [0x5a6464]
frame #18: VLLM::EngineCore() [0x5a6464]
frame #19: VLLM::EngineCore() [0x5a6464]
frame #20: VLLM::EngineCore() [0x5a6464]
frame #21: VLLM::EngineCore() [0x5a6464]
frame #22: VLLM::EngineCore() [0x5a6464]
frame #23: VLLM::EngineCore() [0x4cea98]
frame #24: VLLM::EngineCore() [0x523cb4]
frame #25: _PyObject_ClearManagedDict + 0x1a0 (0x4fbf84 in VLLM::EngineCore)
frame #26: VLLM::EngineCore() [0x526e2c]
frame #27: VLLM::EngineCore() [0x5b408c]
frame #28: VLLM::EngineCore() [0x5b36dc]
frame #29: PyGC_Collect + 0x60 (0x68c440 in VLLM::EngineCore)
frame #30: Py_FinalizeEx + 0xa0 (0x67b2c0 in VLLM::EngineCore)
frame #31: Py_Exit + 0x18 (0x67c708 in VLLM::EngineCore)
frame #32: VLLM::EngineCore() [0x6813c0]
frame #33: VLLM::EngineCore() [0x6810f4]
frame #34: PyRun_SimpleStringFlags + 0x7c (0x67f10c in VLLM::EngineCore)
frame #35: Py_RunMain + 0x390 (0x68b890 in VLLM::EngineCore)
frame #36: Py_BytesMain + 0x28 (0x68b398 in VLLM::EngineCore)
frame #37: + 0x284c4 (0xe5c0181c84c4 in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #38: __libc_start_main + 0x98 (0xe5c0181c8598 in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #39: _start + 0x30 (0x5f6bb0 in VLLM::EngineCore)
(APIServer pid=1) INFO: Shutting down
(APIServer pid=1) INFO: Waiting for application shutdown.
(APIServer pid=1) INFO: Application shutdown complete.
(APIServer pid=1) INFO: Finished server process [1]
torch.AcceleratorError: CUDA error: an illegal instruction was encountered
I run on my spark with container: vllm/vllm-openai:qwen3_5-cu130
shell script:
sudo docker run -d \
--name qwen3.5-122-a3b \
--restart unless-stopped \
--gpus all \
--ipc host \
--shm-size 64gb \
-p 8000:8000 \
-v /home/ubuntu/hf_cache:/root/.cache/huggingface \
vllm/vllm-openai:qwen3_5-cu130 \
--served-model-name Sehyo/Qwen3.5-122B-A10B-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 32000 \
--gpu-memory-utilization 0.8 \
--language-model-only
Is this a custom image? vllm/vllm-openai:qwen3_5-cu130
Is this a custom image? vllm/vllm-openai:qwen3_5-cu130
no, this is from VLLM Dockerhub: https://hub.docker.com/r/vllm/vllm-openai/tags
Today I also tested full range of qwen3.5 models with nightly image, ot also works, just pull the correct architecture (arm64). You can try: docker pull vllm/vllm-openai:nightly-aarch64
Interesting - still not playing ball for me ; more torch.AcceleratorError: CUDA error: an illegal instruction was encountered :
Sometimes it fully loads but performing a inference fails part way through with an illegal instruction.
docker run
--privileged --gpus all -it --rm
--name qwen3.5-122-a3b
--ipc=host
--shm-size 64gb
-p 8000:8000
-v ~/.cache/huggingface:/root/.cache/huggingface
-v ~/.cache/pip:/root/.cache/pip
--entrypoint bash
vllm/vllm-openai:qwen3_5-cu130
-lc '
set -euo pipefail
MODEL_ROOT="/root/.cache/huggingface/hub/models--Sehyo--Qwen3.5-122B-A10B-NVFP4"
if [[ -f "${MODEL_ROOT}/refs/main" ]]; then
SNAP="$(tr -d "\r\n" < "${MODEL_ROOT}/refs/main")"
else
SNAP="$(ls -1t "${MODEL_ROOT}/snapshots" | head -n 1)"
fi
MODEL_PATH="${MODEL_ROOT}/snapshots/${SNAP}"
echo "Serving from: ${MODEL_PATH}"
exec vllm serve "${MODEL_PATH}" \
--served-model-name "Sehyo/Qwen3.5-122B-A10B-NVFP4" \
--trust-remote-code \
--host 0.0.0.0 --port 8000 \
--max-model-len 32000 \
--gpu-memory-utilization 0.8 \
--language-model-only
'
EngineCore_DP0 pid=91) 127 0x4a91bc _PyObject_FastCallDictTstate + 648
(EngineCore_DP0 pid=91) [truncated]
(EngineCore_DP0 pid=91) 2026-03-03 20:13:16,717 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββ| 51/51 [00:07<00:00, 7.13it/s]
Capturing CUDA graphs (decode, FULL): 0%| | 0/35 [00:56<?, ?it/s]
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] EngineCore failed to start.
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] Traceback (most recent call last):
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3023, in synchronize_input_prep
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] yield
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4797, in _dummy_run
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] attn_metadata, _ = self._build_attention_metadata(
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1887, in _build_attention_metadata
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1825, in _build_attn_group_metadata
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] attn_metadata_i = builder.build_for_cudagraph_capture(
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/gdn_attn.py", line 428, in build_for_cudagraph_capture
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] num_decode_draft_tokens_cpu = (num_accepted_tokens - 1).cpu()
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] torch.AcceleratorError: CUDA error: an illegal instruction was encountered
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] Search for cudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] Compile with TORCH_USE_CUDA_DSAto enable device-side assertions. (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] During handling of the above exception, another exception occurred: (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] Traceback (most recent call last): (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1019, in run_engine_core (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] return func(*args, **kwargs) (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 763, in __init__ (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] super().__init__( (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 114, in __init__ (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches( (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] return func(*args, **kwargs) (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 275, in _initialize_kv_caches (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] self.model_executor.initialize_from_config(kv_cache_configs) (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 118, in initialize_from_config (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] self.collective_rpc("compile_or_warm_up_model") (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] return func(*args, **kwargs) (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] return func(*args, **kwargs) (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 503, in compile_or_warm_up_model (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] cuda_graph_memory_bytes = self.model_runner.capture_model() (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] return func(*args, **kwargs) (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5233, in capture_model (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] self._capture_cudagraphs( (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5333, in _capture_cudagraphs (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] dummy_run( (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] return func(*args, **kwargs) (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4777, in _dummy_run (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] with self.synchronize_input_prep(): (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__ (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] self.gen.throw(value) (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3025, in synchronize_input_prep (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] self.prepare_inputs_event.record() (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] torch.AcceleratorError: CUDA error: an illegal instruction was encountered (EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] Search forcudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(EngineCore_DP0 pid=91) ERROR 03-03 20:14:20 [core.py:1029]
(EngineCore_DP0 pid=91) Process EngineCore_DP0:
(EngineCore_DP0 pid=91) Traceback (most recent call last):
(EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3023, in synchronize_input_prep
(EngineCore_DP0 pid=91) yield
(EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4797, in _dummy_run
(EngineCore_DP0 pid=91) attn_metadata, _ = self._build_attention_metadata(
(EngineCore_DP0 pid=91) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1887, in _build_attention_metadata
(EngineCore_DP0 pid=91) _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
(EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1825, in _build_attn_group_metadata
(EngineCore_DP0 pid=91) attn_metadata_i = builder.build_for_cudagraph_capture(
(EngineCore_DP0 pid=91) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/gdn_attn.py", line 428, in build_for_cudagraph_capture
(EngineCore_DP0 pid=91) num_decode_draft_tokens_cpu = (num_accepted_tokens - 1).cpu()
(EngineCore_DP0 pid=91) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=91) torch.AcceleratorError: CUDA error: an illegal instruction was encountered
(EngineCore_DP0 pid=91) Search for cudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore_DP0 pid=91) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore_DP0 pid=91) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore_DP0 pid=91) Compile with TORCH_USE_CUDA_DSAto enable device-side assertions. (EngineCore_DP0 pid=91) (EngineCore_DP0 pid=91) (EngineCore_DP0 pid=91) During handling of the above exception, another exception occurred: (EngineCore_DP0 pid=91) (EngineCore_DP0 pid=91) Traceback (most recent call last): (EngineCore_DP0 pid=91) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=91) self.run() (EngineCore_DP0 pid=91) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=91) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1033, in run_engine_core (EngineCore_DP0 pid=91) raise e (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1019, in run_engine_core (EngineCore_DP0 pid=91) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=91) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=91) return func(*args, **kwargs) (EngineCore_DP0 pid=91) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 763, in __init__ (EngineCore_DP0 pid=91) super().__init__( (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 114, in __init__ (EngineCore_DP0 pid=91) num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches( (EngineCore_DP0 pid=91) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=91) return func(*args, **kwargs) (EngineCore_DP0 pid=91) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 275, in _initialize_kv_caches (EngineCore_DP0 pid=91) self.model_executor.initialize_from_config(kv_cache_configs) (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 118, in initialize_from_config (EngineCore_DP0 pid=91) self.collective_rpc("compile_or_warm_up_model") (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc (EngineCore_DP0 pid=91) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore_DP0 pid=91) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore_DP0 pid=91) return func(*args, **kwargs) (EngineCore_DP0 pid=91) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=91) return func(*args, **kwargs) (EngineCore_DP0 pid=91) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 503, in compile_or_warm_up_model (EngineCore_DP0 pid=91) cuda_graph_memory_bytes = self.model_runner.capture_model() (EngineCore_DP0 pid=91) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=91) return func(*args, **kwargs) (EngineCore_DP0 pid=91) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5233, in capture_model (EngineCore_DP0 pid=91) self._capture_cudagraphs( (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5333, in _capture_cudagraphs (EngineCore_DP0 pid=91) dummy_run( (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=91) return func(*args, **kwargs) (EngineCore_DP0 pid=91) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4777, in _dummy_run (EngineCore_DP0 pid=91) with self.synchronize_input_prep(): (EngineCore_DP0 pid=91) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=91) File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__ (EngineCore_DP0 pid=91) self.gen.throw(value) (EngineCore_DP0 pid=91) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3025, in synchronize_input_prep (EngineCore_DP0 pid=91) self.prepare_inputs_event.record() (EngineCore_DP0 pid=91) torch.AcceleratorError: CUDA error: an illegal instruction was encountered (EngineCore_DP0 pid=91) Search forcudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=91) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=91) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=91) Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(EngineCore_DP0 pid=91)
[rank0]:[W303 20:14:21.207640519 CUDAGuardImpl.h:122] Warning: CUDA warning: an illegal instruction was encountered (function destroyEvent)
[rank0]:[W303 20:14:21.612647513 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/bin/vllm", line 10, in
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 223, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 152, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 125, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 839, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 493, in init
(APIServer pid=1) with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
For me, the inference that tends ti kill it is 'Prove e=mc2' !