It is fast but run for a while easily get error and cause vllm stop.

#9
by james0010 - opened

any idea about the following issue?

/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [12,1,0], thread: [185,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [12,1,0], thread: [186,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [12,1,0], thread: [187,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [12,1,0], thread: [188,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [12,1,0], thread: [189,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [12,1,0], thread: [190,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [12,1,0], thread: [191,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.
(EngineCore pid=152) ERROR 05-07 04:51:50 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.20.2rc1.dev93+g26c6e0ace.d20260507) with config: model='Qwen/Qwen3.6-27B-FP8', speculative_config=SpeculativeConfig(method='dflash', model='z-lab/Qwen3.6-27B-DFlash', num_spec_tokens=6), tokenizer='Qwen/Qwen3.6-27B-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=fp8, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3.6-27B-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '/root/.cache/vllm/torch_compile_cache/23d22236d4', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8', '+quant_fp8', '+quant_fp8', '+quant_fp8', '+quant_fp8', '+quant_fp8', '+quant_fp8'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [7, 14, 21, 28, 35, 42, 49, 56, 70, 77, 84, 91, 98, 105, 112, 126, 133, 140, 147, 154, 161, 168, 182, 189, 196, 203, 210, 217, 224, 238, 245, 252, 259, 273, 294, 308, 322, 336, 357, 371, 385, 406, 420, 434, 448, 469, 483, 497], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 497, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': '/root/.cache/vllm/torch_compile_cache/23d22236d4/rank_0_0/eagle_head', 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=False, moe_backend='auto'),
(EngineCore pid=152) ERROR 05-07 04:51:50 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-5e392e53-1efe-487d-bafb-b718601e27f4-9b8d30bc', 'chatcmpl-6d535732-1ea4-48a6-a0b7-b572119c6bef-96cf96bc', 'chatcmpl-802e1226-3c08-4f95-b960-fd9dc9dc5055-831efb8d'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={'chatcmpl-802e1226-3c08-4f95-b960-fd9dc9dc5055-831efb8d': 44178},new_block_ids=[None, None, None],num_computed_tokens=[79460, 76601, 44177],num_output_tokens=[5461, 0, 117]), num_scheduled_tokens={chatcmpl-802e1226-3c08-4f95-b960-fd9dc9dc5055-831efb8d: 7, chatcmpl-5e392e53-1efe-487d-bafb-b718601e27f4-9b8d30bc: 7, chatcmpl-6d535732-1ea4-48a6-a0b7-b572119c6bef-96cf96bc: 17}, total_num_scheduled_tokens=31, scheduled_spec_decode_tokens={chatcmpl-802e1226-3c08-4f95-b960-fd9dc9dc5055-831efb8d: [-1, -1, -1, -1, -1, -1], chatcmpl-5e392e53-1efe-487d-bafb-b718601e27f4-9b8d30bc: [-1, -1, -1, -1, -1, -1]}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=null)
(EngineCore pid=152) ERROR 05-07 04:51:50 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=3, num_waiting_reqs=0, num_skipped_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.34732824427480913, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] EngineCore encountered a fatal error.
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] Traceback (most recent call last):
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1129, in run_engine_core
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] engine_core.run_busy_loop()
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1170, in run_busy_loop
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] self._process_engine_step()
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1209, in _process_engine_step
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] outputs, model_executed = self.step_fn()
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 473, in step_with_batch_queue
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] exec_future = self.model_executor.execute_model(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 114, in execute_model
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] output.result()
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self.__get_result()
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] raise self._exception
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 84, in collective_rpc
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return func(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 337, in execute_model
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self.worker.execute_model(scheduler_output)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return func(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 841, in execute_model
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] output = self.model_runner.execute_model(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return func(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4114, in execute_model
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] model_output = self._model_forward(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3587, in _model_forward
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self.model(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in call
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self.runnable(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self._call_impl(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return forward_call(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 695, in forward
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] hidden_states = self.language_model.model(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 520, in call
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 224, in call
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self.fn(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 495, in forward
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] def forward(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 217, in call
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self.optimized_call(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "", line 838, in execution_fn
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "", line 188, in __vllm_inlined_submods__87
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in call
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self._op(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 40, in wrapper
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return func(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/attention.py", line 723, in unified_attention_with_output
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] self.impl.forward(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 809, in forward
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] flash_attn_varlen_func(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 300, in flash_attn_varlen_func
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] out, softmax_lse = torch.ops._vllm_fa2_C.varlen_fwd(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in call
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self._op(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] torch.AcceleratorError: CUDA error: device-side assert triggered
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] Search for cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] Compile with TORCH_USE_CUDA_DSAto enable device-side assertions. (EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] (EngineCore pid=152) Process EngineCore: (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] AsyncLLM output_handler failed. (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] Traceback (most recent call last): (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 660, in output_handler (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] outputs = await engine_core.get_output_async() (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] raise self._format_exception(outputs) from None (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] Error in chat completion stream generator. (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] Traceback (most recent call last): (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 519, in chat_completion_stream_generator (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] async for res in result_generator: (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 579, in generate (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] out = q.get_nowait() or await q.get() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise output (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 660, in output_handler (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] outputs = await engine_core.get_output_async() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise self._format_exception(outputs) from None (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] Error in chat completion stream generator. (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] Traceback (most recent call last): (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 519, in chat_completion_stream_generator (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] async for res in result_generator: (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 579, in generate (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] out = q.get_nowait() or await q.get() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise output (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 519, in chat_completion_stream_generator (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] async for res in result_generator: (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 579, in generate (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] out = q.get_nowait() or await q.get() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise output (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 660, in output_handler (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] outputs = await engine_core.get_output_async() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise self._format_exception(outputs) from None (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] Error in chat completion stream generator. (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] Traceback (most recent call last): (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 519, in chat_completion_stream_generator (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] async for res in result_generator: (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 579, in generate (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] out = q.get_nowait() or await q.get() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise output (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 519, in chat_completion_stream_generator (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] async for res in result_generator: (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 579, in generate (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] out = q.get_nowait() or await q.get() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise output (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 519, in chat_completion_stream_generator (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] async for res in result_generator: (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 579, in generate (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] out = q.get_nowait() or await q.get() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise output (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 660, in output_handler (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] outputs = await engine_core.get_output_async() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise self._format_exception(outputs) from None (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (EngineCore pid=152) Traceback (most recent call last): (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1140, in run_engine_core (EngineCore pid=152) raise e (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1129, in run_engine_core (EngineCore pid=152) engine_core.run_busy_loop() (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1170, in run_busy_loop (EngineCore pid=152) self._process_engine_step() (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1209, in _process_engine_step (EngineCore pid=152) outputs, model_executed = self.step_fn() (EngineCore pid=152) ^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 473, in step_with_batch_queue (EngineCore pid=152) exec_future = self.model_executor.execute_model( (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 114, in execute_model (EngineCore pid=152) output.result() (EngineCore pid=152) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore pid=152) return self.__get_result() (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore pid=152) raise self._exception (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 84, in collective_rpc (EngineCore pid=152) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 510, in run_method (EngineCore pid=152) return func(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 337, in execute_model (EngineCore pid=152) return self.worker.execute_model(scheduler_output) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=152) return func(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 841, in execute_model (EngineCore pid=152) output = self.model_runner.execute_model( (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=152) return func(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4114, in execute_model (EngineCore pid=152) model_output = self._model_forward( (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3587, in _model_forward (EngineCore pid=152) return self.model( (EngineCore pid=152) ^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__ (EngineCore pid=152) return self.runnable(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (EngineCore pid=152) return self._call_impl(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl (EngineCore pid=152) return forward_call(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 695, in forward (EngineCore pid=152) hidden_states = self.language_model.model( (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 520, in __call__ (EngineCore pid=152) return self.aot_compiled_fn(self, *args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 224, in __call__ (EngineCore pid=152) return self.fn(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 495, in forward (EngineCore pid=152) def forward( (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 217, in __call__ (EngineCore pid=152) return self.optimized_call(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "<string>", line 838, in execution_fn (EngineCore pid=152) File "<string>", line 188, in __vllm_inlined_submods__87 (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in __call__ (EngineCore pid=152) return self._op(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 40, in wrapper (EngineCore pid=152) return func(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/attention.py", line 723, in unified_attention_with_output (EngineCore pid=152) self.impl.forward( (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 809, in forward (EngineCore pid=152) flash_attn_varlen_func( (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 300, in flash_attn_varlen_func (EngineCore pid=152) out, softmax_lse = torch.ops._vllm_fa2_C.varlen_fwd( (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in __call__ (EngineCore pid=152) return self._op(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) torch.AcceleratorError: CUDA error: device-side assert triggered (EngineCore pid=152) Search forcudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore pid=152) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore pid=152) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore pid=152) Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(EngineCore pid=152)
(EngineCore pid=152)
(EngineCore pid=152) During handling of the above exception, another exception occurred:
(EngineCore pid=152)
(EngineCore pid=152) Traceback (most recent call last):
(EngineCore pid=152) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=152) self.run()
(EngineCore pid=152) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=152) self._target(*self._args, **self._kwargs)
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1147, in run_engine_core
(EngineCore pid=152) engine_core.shutdown()
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 574, in shutdown
(EngineCore pid=152) self.model_executor.shutdown()
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 137, in shutdown
(EngineCore pid=152) worker.shutdown()
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 212, in shutdown
(EngineCore pid=152) self.worker.shutdown()
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 1049, in shutdown
(EngineCore pid=152) model_runner.shutdown()
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5992, in shutdown
(EngineCore pid=152) self._cleanup_profiling_kv_cache()
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6000, in _cleanup_profiling_kv_cache
(EngineCore pid=152) torch.accelerator.synchronize()
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/accelerator/init.py", line 263, in synchronize
(EngineCore pid=152) torch._C._accelerator_synchronizeDevice(device_index)
(EngineCore pid=152) torch.AcceleratorError: CUDA error: device-side assert triggered
(EngineCore pid=152) Search for cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore pid=152) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=152) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=152) Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore pid=152)
(APIServer pid=77) INFO: 172.20.0.3:51526 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=77) INFO: 172.20.0.3:51540 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=77) INFO: 172.20.0.3:51544 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=77) INFO: Shutting down
(APIServer pid=77) INFO: Waiting for application shutdown.
(APIServer pid=77) INFO: Application shutdown complete.
(APIServer pid=77) INFO: Finished server process [77]

I doubt that multiple query simultaneously easily cause the error

Sign up or log in to comment