VLLM 0.18.0 runs with an error

by SongXiaoMao - opened 26 days ago

(base) cheng@cheng:/model$ docker logs -f vllm
(APIServer pid=1) INFO 03-28 01:15:12 [utils.py:297]
(APIServer pid=1) INFO 03-28 01:15:12 [utils.py:297] █ █ █▄ ▄█
(APIServer pid=1) INFO 03-28 01:15:12 [utils.py:297] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.18.1rc1.dev32+g1f0d21064
(APIServer pid=1) INFO 03-28 01:15:12 [utils.py:297] █▄█▀ █ █ █ █ model /model
(APIServer pid=1) INFO 03-28 01:15:12 [utils.py:297] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 03-28 01:15:12 [utils.py:297]
(APIServer pid=1) INFO 03-28 01:15:12 [utils.py:233] non-default args: {'model_tag': '/model', 'default_chat_template_kwargs': {'enable_thinking': False}, 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'api_key': ['abc123'], 'model': '/model', 'max_model_len': 131072, 'served_model_name': ['Qwen3.5-27B'], 'generation_config': 'vllm', 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 4, 'disable_custom_all_reduce': True, 'gpu_memory_utilization': 0.888, 'limit_mm_per_prompt': {'video': 0}, 'mm_encoder_attn_backend': 'TORCH_SDPA', 'max_num_seqs': 10, 'async_scheduling': True, 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 4}, 'attention_config': AttentionConfig(backend=<AttentionBackendEnum.FLASHINFER: 'vllm.v1.attention.backends.flashinfer.FlashInferBackend'>, flash_attn_version=None, use_prefill_decode_attention=False, flash_attn_max_num_splits_for_cuda_graph=32, use_cudnn_prefill=False, use_trtllm_ragged_deepseek_prefill=False, use_trtllm_attention=None, disable_flashinfer_prefill=True, disable_flashinfer_q_quantization=False, use_prefill_query_quantization=False), 'enable_log_requests': True}
(APIServer pid=1) INFO 03-28 01:15:20 [model.py:540] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 03-28 01:15:20 [model.py:1606] Using max model len 131072
(APIServer pid=1) INFO 03-28 01:15:27 [model.py:540] Resolved architecture: Qwen3_5MTP
(APIServer pid=1) INFO 03-28 01:15:27 [model.py:1606] Using max model len 262144
(APIServer pid=1) WARNING 03-28 01:15:27 [speculative.py:499] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 03-28 01:15:28 [config.py:228] Setting attention block size to 816 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 03-28 01:15:28 [config.py:259] Padding mamba page size by 1.62% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 03-28 01:15:28 [vllm.py:750] Asynchronous scheduling is enabled.
(EngineCore pid=41) INFO 03-28 01:15:41 [core.py:105] Initializing a V1 LLM engine (v0.18.1rc1.dev32+g1f0d21064) with config: model='/model', speculative_config=SpeculativeConfig(method='mtp', model='/model', num_spec_tokens=4), tokenizer='/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-27B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 96, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore pid=41) INFO 03-28 01:15:41 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.2 (local), world_size=4, local_world_size=4
(Worker pid=52) INFO 03-28 01:15:49 [parallel_state.py:1400] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:59741 backend=nccl
(Worker pid=53) INFO 03-28 01:15:49 [parallel_state.py:1400] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:59741 backend=nccl
(Worker pid=55) INFO 03-28 01:15:49 [parallel_state.py:1400] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:59741 backend=nccl
(Worker pid=54) INFO 03-28 01:15:49 [parallel_state.py:1400] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:59741 backend=nccl
(Worker pid=52) :1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=55) :1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=53) :1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=52) :1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=55) :1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=53) :1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=52) INFO 03-28 01:15:49 [pynccl.py:111] vLLM is using nccl==2.27.5
(Worker pid=54) :1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=54) :1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=52) WARNING 03-28 01:15:50 [symm_mem.py:68] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
(Worker pid=53) WARNING 03-28 01:15:50 [symm_mem.py:68] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
(Worker pid=54) WARNING 03-28 01:15:50 [symm_mem.py:68] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
(Worker pid=55) WARNING 03-28 01:15:50 [symm_mem.py:68] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
(Worker pid=52) INFO 03-28 01:15:50 [parallel_state.py:1716] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=52) INFO 03-28 01:15:50 [topk_topp_sampler.py:51] Using FlashInfer for top-p & top-k sampling.
(Worker pid=54) WARNING 03-28 01:15:50 [init.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=52) WARNING 03-28 01:15:50 [init.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=53) WARNING 03-28 01:15:50 [init.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=55) WARNING 03-28 01:15:50 [init.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(Worker_TP0 pid=52) INFO 03-28 01:15:56 [gpu_model_runner.py:4493] Starting to load model /model...
(Worker_TP0 pid=52) INFO 03-28 01:15:56 [cuda.py:372] Using backend AttentionBackendEnum.TORCH_SDPA for vit attention
(Worker_TP0 pid=52) INFO 03-28 01:15:56 [mm_encoder_attention.py:230] Using AttentionBackendEnum.TORCH_SDPA for MMEncoderAttention.
(Worker_TP0 pid=52) INFO 03-28 01:15:56 [qwen3_next.py:202] Using Triton/FLA GDN prefill kernel
(Worker_TP0 pid=52) INFO 03-28 01:15:56 [cuda.py:274] Using AttentionBackendEnum.FLASHINFER backend.
(Worker_TP1 pid=53) INFO 03-28 01:15:56 [cuda.py:274] Using AttentionBackendEnum.FLASHINFER backend.
(Worker_TP2 pid=54) INFO 03-28 01:15:56 [cuda.py:274] Using AttentionBackendEnum.FLASHINFER backend.
(Worker_TP3 pid=55) INFO 03-28 01:15:56 [cuda.py:274] Using AttentionBackendEnum.FLASHINFER backend.
Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:00<00:02, 5.00it/s]
Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:00<00:01, 5.42it/s]
Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:00<00:01, 5.44it/s]
Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:00<00:01, 5.46it/s]
Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:00<00:01, 5.43it/s]
Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:01<00:00, 5.40it/s]
Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:01<00:00, 5.45it/s]
Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:01<00:00, 5.70it/s]
Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:01<00:00, 5.82it/s]
Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:01<00:00, 5.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:02<00:00, 4.91it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:02<00:00, 5.27it/s]
(Worker_TP0 pid=52)
(Worker_TP0 pid=52) INFO 03-28 01:15:59 [default_loader.py:384] Loading weights took 2.10 seconds
(Worker_TP0 pid=52) WARNING 03-28 01:15:59 [marlin_utils_fp8.py:97] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] WorkerProc failed to start.
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] Traceback (most recent call last):
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 826, in worker_main
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] worker = WorkerProc(*args, **kwargs)
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 613, in __init__
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] self.worker.load_model()
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4509, in load_model
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] self.model = model_loader.load_model(
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 74, in load_model
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] process_weights_after_loading(model, model_config, target_device)
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 106, in process_weights_after_loading
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] quant_method.process_weights_after_loading(module)
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 878, in process_weights_after_loading
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] layer.scheme.process_weights_after_loading(layer)
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py", line 133, in process_weights_after_loading
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] prepare_fp8_layer_for_marlin(layer, size_k_first=size_k_first)
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 127, in prepare_fp8_layer_for_marlin
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] marlin_qweight = ops.gptq_marlin_repack(
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 1263, in gptq_marlin_repack
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return torch.ops._C.gptq_marlin_repack(
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in __call__
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return self._op(*args, **kwargs)
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=54) ERROR 03-28 01:15:59 [multiproc_executor.py:857] RuntimeError: size_n = 864 is not divisible by tile_n_size = 64
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] WorkerProc failed to start.
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] Traceback (most recent call last):
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 826, in worker_main
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] worker = WorkerProc(*args, **kwargs)
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 613, in __init__
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] self.worker.load_model()
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4509, in load_model
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] self.model = model_loader.load_model(
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 74, in load_model
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] process_weights_after_loading(model, model_config, target_device)
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 106, in process_weights_after_loading
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] quant_method.process_weights_after_loading(module)
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 878, in process_weights_after_loading
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] layer.scheme.process_weights_after_loading(layer)
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py", line 133, in process_weights_after_loading
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] prepare_fp8_layer_for_marlin(layer, size_k_first=size_k_first)
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 127, in prepare_fp8_layer_for_marlin
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] marlin_qweight = ops.gptq_marlin_repack(
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 1263, in gptq_marlin_repack
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return torch.ops._C.gptq_marlin_repack(
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in __call__
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return self._op(*args, **kwargs)
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=52) ERROR 03-28 01:15:59 [multiproc_executor.py:857] RuntimeError: size_n = 864 is not divisible by tile_n_size = 64
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] WorkerProc failed to start.
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] Traceback (most recent call last):
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 826, in worker_main
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] worker = WorkerProc(*args, **kwargs)
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 613, in __init__
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] self.worker.load_model()
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4509, in load_model
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] self.model = model_loader.load_model(
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 74, in load_model
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] process_weights_after_loading(model, model_config, target_device)
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 106, in process_weights_after_loading
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] quant_method.process_weights_after_loading(module)
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 878, in process_weights_after_loading
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] layer.scheme.process_weights_after_loading(layer)
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py", line 133, in process_weights_after_loading
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] prepare_fp8_layer_for_marlin(layer, size_k_first=size_k_first)
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 127, in prepare_fp8_layer_for_marlin
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] marlin_qweight = ops.gptq_marlin_repack(
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 1263, in gptq_marlin_repack
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return torch.ops._C.gptq_marlin_repack(
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in __call__
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return self._op(*args, **kwargs)
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=53) ERROR 03-28 01:15:59 [multiproc_executor.py:857] RuntimeError: size_n = 864 is not divisible by tile_n_size = 64
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] WorkerProc failed to start.
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] Traceback (most recent call last):
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 826, in worker_main
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] worker = WorkerProc(*args, **kwargs)
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 613, in __init__
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] self.worker.load_model()
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4509, in load_model
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] self.model = model_loader.load_model(
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 74, in load_model
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] process_weights_after_loading(model, model_config, target_device)
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 106, in process_weights_after_loading
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] quant_method.process_weights_after_loading(module)
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 878, in process_weights_after_loading
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] layer.scheme.process_weights_after_loading(layer)
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py", line 133, in process_weights_after_loading
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] prepare_fp8_layer_for_marlin(layer, size_k_first=size_k_first)
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 127, in prepare_fp8_layer_for_marlin
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] marlin_qweight = ops.gptq_marlin_repack(
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 1263, in gptq_marlin_repack
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return torch.ops._C.gptq_marlin_repack(
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in __call__
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] return self._op(*args, **kwargs)
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=55) ERROR 03-28 01:15:59 [multiproc_executor.py:857] RuntimeError: size_n = 864 is not divisible by tile_n_size = 64
[rank0]:[W328 01:16:00.892524535 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] EngineCore failed to start.
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=41) Process EngineCore:
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] super().__init__(
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] self.model_executor = executor_class(vllm_config)
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 101, in __init__
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] super().__init__(vllm_config)
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] self._init_executor()
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 190, in _init_executor
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 736, in wait_for_ready
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] raise e from None
(EngineCore pid=41) ERROR 03-28 01:16:01 [core.py:1108] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore pid=41) Traceback (most recent call last):
(EngineCore pid=41) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=41) self.run()
(EngineCore pid=41) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=41) self._target(*self._args, **self._kwargs)
(EngineCore pid=41) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core
(EngineCore pid=41) raise e
(EngineCore pid=41) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=41) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=41) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=41) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=41) return func(*args, **kwargs)
(EngineCore pid=41) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=41) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=41) super().__init__(
(EngineCore pid=41) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=41) self.model_executor = executor_class(vllm_config)
(EngineCore pid=41) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=41) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 101, in __init__
(EngineCore pid=41) super().__init__(vllm_config)
(EngineCore pid=41) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=41) return func(*args, **kwargs)
(EngineCore pid=41) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=41) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=41) self._init_executor()
(EngineCore pid=41) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 190, in _init_executor
(EngineCore pid=41) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore pid=41) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=41) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 736, in wait_for_ready
(EngineCore pid=41) raise e from None
(EngineCore pid=41) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/bin/vllm", line 10, in
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 127, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 887, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 535, in init
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
(base) cheng@cheng:/model$

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment