how to load it by vllm or sglang
by sglang,got this error:
[2026-04-12 10:23:36 TP0] Load weight begin. avail mem=77.37 GB [2026-04-12 10:23:36 TP0] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod [2026-04-12 10:23:36 TP2] Load weight begin. avail mem=77.32 GB [2026-04-12 10:23:36 TP4] Load weight begin. avail mem=77.32 GB [2026-04-12 10:23:36 TP2] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod [2026-04-12 10:23:36 TP6] Load weight begin. avail mem=77.32 GB [2026-04-12 10:23:36 TP4] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod [2026-04-12 10:23:36 TP6] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod [2026-04-12 10:23:36 TP5] Load weight begin. avail mem=77.32 GB [2026-04-12 10:23:36 TP5] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod [2026-04-12 10:23:36 TP7] Load weight begin. avail mem=77.56 GB [2026-04-12 10:23:36 TP3] Load weight begin. avail mem=77.32 GB [2026-04-12 10:23:36 TP3] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod [2026-04-12 10:23:36 TP7] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod [2026-04-12 10:23:36 TP1] Load weight begin. avail mem=77.32 GB [2026-04-12 10:23:36 TP1] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod [2026-04-12 10:23:36 TP0] Scheduler hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3600, in run_scheduler_process scheduler = Scheduler( ^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 385, in init self.init_model_worker() File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 632, in init_model_worker self.init_tp_model_worker() File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 600, in init_tp_model_worker self.tp_worker = TpModelWorker(**worker_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 261, in init self._init_model_runner() File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 344, in _init_model_runner self._model_runner = ModelRunner( ^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 427, in init self.initialize(pre_model_load_memory) File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 507, in initialize self.load_model() File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1159, in load_model self.model = self.loader.load_model( ^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 683, in load_model model = _initialize_model( ^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 277, in _initialize_model return model_class(**kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2142, in init self.model = DeepseekV2Model( ^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1876, in init self.layers, self.start_layer, self.end_layer = make_layers( ^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 644, in make_layers + get_offloader().wrap_modules( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/utils/offloader.py", line 36, in wrap_modules return list(all_modules_generator) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 646, in layer_fn(idx=idx, prefix=add_prefix(idx, prefix)) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1878, in lambda idx, prefix: DeepseekV2DecoderLayer( ^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1576, in init self.self_attn = DeepseekV2AttentionMLA( ^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1308, in init and self.fused_qkv_a_proj_with_mqa.weight.dtype == torch.bfloat16 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1964, in getattr raise AttributeError( AttributeError: 'ReplicatedLinear' object has no attribute 'weight' [2026-04-12 10:23:36] Received sigquit from a child process. It usually means the child failed.
using Vllm docker
docker run -d
--restart=unless-stopped
--shm-size=320g
--gpus all
--network host
--ipc host
--name vllm-glm5.1
vllm/vllm-openai:glm51-cu130
/models/GLM5.1-AWQ
--tensor-parallel-size 8
--enable-expert-parallel
--enforce-eager
--max-num-seqs 4
--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice
--gpu-memory-utilization 0.94
--port 12345
--host 0.0.0.0
--served-model-name glm-5.1-awq
harware: 8x H100
failed to load:
(APIServer pid=1) INFO 04-12 12:16:28 [utils.py:299]
(APIServer pid=1) INFO 04-12 12:16:28 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=1) INFO 04-12 12:16:28 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.1.dev1+g43a9b1afb
(APIServer pid=1) INFO 04-12 12:16:28 [utils.py:299] █▄█▀ █ █ █ █ model /models/GLM5.1-AWQ
(APIServer pid=1) INFO 04-12 12:16:28 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 04-12 12:16:28 [utils.py:299]
(APIServer pid=1) INFO 04-12 12:16:28 [utils.py:233] non-default args: {'model_tag': '/models/GLM5.1-AWQ', 'enable_auto_tool_choice': True, 'tool_call_parser': 'glm47', 'host': '0.0.0.0', 'port': 12345, 'model': '/models/GLM5.1-AWQ', 'enforce_eager': True, 'served_model_name': ['glm-4.7-fp8'], 'reasoning_parser': 'glm45', 'tensor_parallel_size': 8, 'enable_expert_parallel': True, 'gpu_memory_utilization': 0.94, 'max_num_seqs': 4, 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 1}}
(APIServer pid=1) INFO 04-12 12:16:28 [model.py:549] Resolved architecture: GlmMoeDsaForCausalLM
(APIServer pid=1) INFO 04-12 12:16:28 [model.py:1678] Using max model len 202752
(APIServer pid=1) INFO 04-12 12:16:28 [model.py:549] Resolved architecture: DeepSeekMTPModel
(APIServer pid=1) INFO 04-12 12:16:28 [model.py:1678] Using max model len 202752
(APIServer pid=1) INFO 04-12 12:16:28 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) INFO 04-12 12:16:28 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 04-12 12:16:28 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 04-12 12:16:28 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 04-12 12:16:28 [vllm.py:1025] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 04-12 12:16:31 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant, allreduce_rms
(APIServer pid=1) The following generation flags are not valid and may be ignored: ['top_p']. Set TRANSFORMERS_VERBOSITY=info for more details.
(EngineCore pid=208) INFO 04-12 12:16:38 [core.py:105] Initializing a V1 LLM engine (v0.19.1.dev1+g43a9b1afb) with config: model='/models/GLM5.1-AWQ', speculative_config=SpeculativeConfig(method='mtp', model='/models/GLM5.1-AWQ', num_spec_tokens=1), tokenizer='/models/GLM5.1-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=202752, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='glm45', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=glm-4.7-fp8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [42, 8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore pid=208) WARNING 04-12 12:16:38 [multiproc_executor.py:1014] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=208) INFO 04-12 12:16:38 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=192.168.254.101 (local), world_size=8, local_world_size=8
(Worker pid=407) INFO 04-12 12:16:46 [parallel_state.py:1400] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:36365 backend=nccl
(Worker pid=409) INFO 04-12 12:16:49 [parallel_state.py:1400] world_size=8 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:36365 backend=nccl
(Worker pid=414) INFO 04-12 12:16:49 [parallel_state.py:1400] world_size=8 rank=7 local_rank=7 distributed_init_method=tcp://127.0.0.1:36365 backend=nccl
(Worker pid=412) INFO 04-12 12:16:49 [parallel_state.py:1400] world_size=8 rank=5 local_rank=5 distributed_init_method=tcp://127.0.0.1:36365 backend=nccl
(Worker pid=408) INFO 04-12 12:16:50 [parallel_state.py:1400] world_size=8 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:36365 backend=nccl
(Worker pid=413) INFO 04-12 12:16:50 [parallel_state.py:1400] world_size=8 rank=6 local_rank=6 distributed_init_method=tcp://127.0.0.1:36365 backend=nccl
(Worker pid=411) INFO 04-12 12:16:50 [parallel_state.py:1400] world_size=8 rank=4 local_rank=4 distributed_init_method=tcp://127.0.0.1:36365 backend=nccl
(Worker pid=410) INFO 04-12 12:16:50 [parallel_state.py:1400] world_size=8 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:36365 backend=nccl
(Worker pid=407) INFO 04-12 12:16:50 [pynccl.py:111] vLLM is using nccl==2.28.9
(Worker pid=407) INFO 04-12 12:16:56 [parallel_state.py:1716] rank 0 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=414) WARNING 04-12 12:16:57 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=410) WARNING 04-12 12:16:57 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=412) WARNING 04-12 12:16:57 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=409) WARNING 04-12 12:16:57 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=411) WARNING 04-12 12:16:57 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=408) WARNING 04-12 12:16:57 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=407) WARNING 04-12 12:16:57 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker_TP0_EP0 pid=407) INFO 04-12 12:16:57 [gpu_model_runner.py:4735] Starting to load model /models/GLM5.1-AWQ...
(Worker pid=413) WARNING 04-12 12:16:57 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker_TP7_EP7 pid=414) INFO 04-12 12:16:57 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP5_EP5 pid=412) INFO 04-12 12:16:57 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP4_EP4 pid=411) INFO 04-12 12:16:57 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP3_EP3 pid=410) INFO 04-12 12:16:57 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP2_EP2 pid=409) INFO 04-12 12:16:57 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP1_EP1 pid=408) INFO 04-12 12:16:57 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP0_EP0 pid=407) INFO 04-12 12:16:57 [cuda.py:334] Using FLASHMLA_SPARSE attention backend out of potential backends: ['FLASHMLA_SPARSE'].
(Worker_TP0_EP0 pid=407) INFO 04-12 12:16:57 [deep_gemm.py:115] DeepGEMM E8M0 enabled on current platform.
(Worker_TP0_EP0 pid=407) INFO 04-12 12:16:57 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP0_EP0 pid=407) INFO 04-12 12:16:57 [layer.py:396] [EP Rank 0/8] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 32/256. Experts local to global index map: 0->0, 1->1, 2->2, 3->3, 4->4, 5->5, 6->6, 7->7, 8->8, 9->9, 10->10, 11->11, 12->12, 13->13, 14->14, 15->15, 16->16, 17->17, 18->18, 19->19, 20->20, 21->21, 22->22, 23->23, 24->24, 25->25, 26->26, 27->27, 28->28, 29->29, 30->30, 31->31.
(Worker_TP0_EP0 pid=407) INFO 04-12 12:16:57 [compressed_tensors_moe.py:194] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_TP0_EP0 pid=407) INFO 04-12 12:16:57 [compressed_tensors_moe.py:1180] Using Marlin backend for WNA16 MoE (group_size=32, num_bits=4)
(Worker_TP6_EP6 pid=413) INFO 04-12 12:16:57 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
Loading safetensors checkpoint shards: 0% Completed | 0/81 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 1% Completed | 1/81 [00:00<00:14, 5.63it/s]
Loading safetensors checkpoint shards: 2% Completed | 2/81 [00:00<00:37, 2.08it/s]
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] WorkerProc failed to start.
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] Traceback (most recent call last):
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 826, in worker_main
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] worker = WorkerProc(*args, **kwargs)
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 613, in init
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] self.worker.load_model()
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] self.model = model_loader.load_model(
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] self.load_weights(model, model_config)
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 381, in load_weights
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1619, in load_weights
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] param = params_dict[name]
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ~~~~~~~~~~~^^^^^^
(Worker_TP7_EP7 pid=414) ERROR 04-12 12:16:59 [multiproc_executor.py:857] KeyError: 'model.layers.3.self_attn.kv_a_proj_with_mqa.weight'
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] WorkerProc failed to start.
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] Traceback (most recent call last):
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 826, in worker_main
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] worker = WorkerProc(*args, **kwargs)
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 613, in init
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] self.worker.load_model()
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] self.model = model_loader.load_model(
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] self.load_weights(model, model_config)
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 381, in load_weights
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1619, in load_weights
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] param = params_dict[name]
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] ~~~~~~~~~~~^^^^^^
(Worker_TP5_EP5 pid=412) ERROR 04-12 12:16:59 [multiproc_executor.py:857] KeyError: 'model.layers.3.self_attn.kv_a_proj_with_mqa.weight'
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] WorkerProc failed to start.
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] Traceback (most recent call last):
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 826, in worker_main
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] worker = WorkerProc(*args, **kwargs)
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 613, in init
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.worker.load_model()
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.model = model_loader.load_model(
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.load_weights(model, model_config)
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 381, in load_weights
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1619, in load_weights
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] param = params_dict[name]
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ~~~~~~~~~~~^^^^^^
(Worker_TP1_EP1 pid=408) ERROR 04-12 12:17:00 [multiproc_executor.py:857] KeyError: 'model.layers.3.self_attn.kv_a_proj_with_mqa.weight'
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] WorkerProc failed to start.
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] Traceback (most recent call last):
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 826, in worker_main
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] worker = WorkerProc(*args, **kwargs)
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 613, in init
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.worker.load_model()
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.model = model_loader.load_model(
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.load_weights(model, model_config)
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 381, in load_weights
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1619, in load_weights
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] param = params_dict[name]
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ~~~~~~~~~~~^^^^^^
(Worker_TP6_EP6 pid=413) ERROR 04-12 12:17:00 [multiproc_executor.py:857] KeyError: 'model.layers.3.self_attn.kv_a_proj_with_mqa.weight'
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] WorkerProc failed to start.
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] Traceback (most recent call last):
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 826, in worker_main
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] worker = WorkerProc(*args, **kwargs)
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 613, in init
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.worker.load_model()
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.model = model_loader.load_model(
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.load_weights(model, model_config)
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 381, in load_weights
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1619, in load_weights
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] param = params_dict[name]
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ~~~~~~~~~~~^^^^^^
(Worker_TP3_EP3 pid=410) ERROR 04-12 12:17:00 [multiproc_executor.py:857] KeyError: 'model.layers.3.self_attn.kv_a_proj_with_mqa.weight'
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] WorkerProc failed to start.
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] Traceback (most recent call last):
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 826, in worker_main
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] worker = WorkerProc(*args, **kwargs)
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 613, in init
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.worker.load_model()
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.model = model_loader.load_model(
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.load_weights(model, model_config)
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 381, in load_weights
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1619, in load_weights
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] param = params_dict[name]
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ~~~~~~~~~~~^^^^^^
(Worker_TP2_EP2 pid=409) ERROR 04-12 12:17:00 [multiproc_executor.py:857] KeyError: 'model.layers.3.self_attn.kv_a_proj_with_mqa.weight'
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] WorkerProc failed to start.
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] Traceback (most recent call last):
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 826, in worker_main
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] worker = WorkerProc(*args, **kwargs)
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 613, in init
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.worker.load_model()
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.model = model_loader.load_model(
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] self.load_weights(model, model_config)
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 381, in load_weights
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1619, in load_weights
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] param = params_dict[name]
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] ~~~~~~~~~~~^^^^^^
(Worker_TP4_EP4 pid=411) ERROR 04-12 12:17:00 [multiproc_executor.py:857] KeyError: 'model.layers.3.self_attn.kv_a_proj_with_mqa.weight'
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] WorkerProc failed to start.
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] Traceback (most recent call last):
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 826, in worker_main
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] worker = WorkerProc(*args, **kwargs)
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 613, in init
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] self.worker.load_model()
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
Loading safetensors checkpoint shards: 2% Completed | 2/81 [00:02<01:30, 1.14s/it]
(Worker_TP0_EP0 pid=407)
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] self.model = model_loader.load_model(
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] self.load_weights(model, model_config)
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 381, in load_weights
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1619, in load_weights
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] param = params_dict[name]
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] ~~~~~~~~~~~^^^^^^
(Worker_TP0_EP0 pid=407) ERROR 04-12 12:17:01 [multiproc_executor.py:857] KeyError: 'model.layers.3.self_attn.kv_a_proj_with_mqa.weight'
[rank0]:[W412 12:17:01.879872988 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] EngineCore failed to start.
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in init
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] super().init(
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 114, in init
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] self.model_executor = executor_class(vllm_config)
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 101, in init
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] super().init(vllm_config)
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=208) Process EngineCore:
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in init
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] self._init_executor()
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 190, in _init_executor
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 736, in wait_for_ready
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] raise e from None
(EngineCore pid=208) ERROR 04-12 12:17:03 [core.py:1108] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause
Thanks for letting me know :) Please re-download config.json as it should work now.
Please let me know if the problem still occurs.
command: docker run -d --shm-size=512g --gpus all -p 18889:18889 -v /data/GLM-5.1-AWQ-4bit:/data//GLM-5.1-AWQ-4bit vllm/vllm-openai:glm51-cu130 /data/GLM-5.1-AWQ-4bit --served-model-name minum-security-llm --max-num-seqs 64 --max-model-len 200000 --gpu-memory-utilization 0.8 --enable-chunked-prefill --max-num-batched-tokens 8192 --tensor-parallel-size 8 --enable-expert-parallel --speculative-config.method mtp --speculative-config.num_speculative_tokens 3 --tool-call-parser glm47 --reasoning-parser glm45 --enable-auto-tool-choice --trust-remote-code --host 0.0.0.0 --port 18889
runtime log:
(APIServer pid=1) INFO 04-12 09:19:51 [model.py:549] Resolved architecture: GlmMoeDsaForCausalLM
(APIServer pid=1) INFO 04-12 09:19:51 [model.py:1678] Using max model len 200000
(APIServer pid=1) INFO 04-12 09:19:56 [model.py:549] Resolved architecture: DeepSeekMTPModel
(APIServer pid=1) INFO 04-12 09:19:56 [model.py:1678] Using max model len 202752
(APIServer pid=1) WARNING 04-12 09:19:56 [speculative.py:512] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 04-12 09:19:56 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) INFO 04-12 09:19:56 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 04-12 09:19:59 [compilation.py:290] Enabled custom fusions: allreduce_rms
(APIServer pid=1) The following generation flags are not valid and may be ignored: ['top_p']. Set TRANSFORMERS_VERBOSITY=info for more details.
(EngineCore pid=594) INFO 04-12 09:20:05 [core.py:105] Initializing a V1 LLM engine (v0.19.1.dev1+g43a9b1afb) with config: model='/data/GLM-5.1-AWQ-4bit', speculative_config=SpeculativeConfig(method='mtp', model='/data/GLM-5.1-AWQ-4bit', num_spec_tokens=3), tokenizer='/data/GLM-5.1-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=200000, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='glm45', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=minum-security-llm, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [42, 8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore pid=594) WARNING 04-12 09:20:05 [multiproc_executor.py:1014] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=594) INFO 04-12 09:20:05 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.2 (local), world_size=8, local_world_size=8
(Worker pid=793) INFO 04-12 09:20:10 [parallel_state.py:1400] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:43025 backend=nccl
(Worker pid=798) INFO 04-12 09:20:14 [parallel_state.py:1400] world_size=8 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:43025 backend=nccl
(Worker pid=804) INFO 04-12 09:20:19 [parallel_state.py:1400] world_size=8 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:43025 backend=nccl
(Worker pid=816) INFO 04-12 09:20:23 [parallel_state.py:1400] world_size=8 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:43025 backend=nccl
(Worker pid=828) INFO 04-12 09:20:28 [parallel_state.py:1400] world_size=8 rank=4 local_rank=4 distributed_init_method=tcp://127.0.0.1:43025 backend=nccl
(Worker pid=840) INFO 04-12 09:20:32 [parallel_state.py:1400] world_size=8 rank=5 local_rank=5 distributed_init_method=tcp://127.0.0.1:43025 backend=nccl
(Worker pid=852) INFO 04-12 09:20:37 [parallel_state.py:1400] world_size=8 rank=6 local_rank=6 distributed_init_method=tcp://127.0.0.1:43025 backend=nccl
(Worker pid=864) INFO 04-12 09:20:41 [parallel_state.py:1400] world_size=8 rank=7 local_rank=7 distributed_init_method=tcp://127.0.0.1:43025 backend=nccl
(Worker pid=793) INFO 04-12 09:20:42 [pynccl.py:111] vLLM is using nccl==2.28.9
(Worker pid=793) INFO 04-12 09:20:47 [parallel_state.py:1716] rank 0 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=816) WARNING 04-12 09:20:48 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=864) WARNING 04-12 09:20:48 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=852) WARNING 04-12 09:20:48 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=793) WARNING 04-12 09:20:48 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=828) WARNING 04-12 09:20:48 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker_TP0_EP0 pid=793) INFO 04-12 09:20:48 [gpu_model_runner.py:4735] Starting to load model /data/GLM-5.1-AWQ-4bit...
(Worker pid=840) WARNING 04-12 09:20:48 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=804) WARNING 04-12 09:20:48 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker pid=798) WARNING 04-12 09:20:48 [init.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(Worker_TP3_EP3 pid=816) INFO 04-12 09:20:48 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP0_EP0 pid=793) INFO 04-12 09:20:48 [cuda.py:334] Using FLASHMLA_SPARSE attention backend out of potential backends: ['FLASHMLA_SPARSE'].
(Worker_TP7_EP7 pid=864) INFO 04-12 09:20:48 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP0_EP0 pid=793) INFO 04-12 09:20:48 [deep_gemm.py:115] DeepGEMM E8M0 enabled on current platform.
(Worker_TP6_EP6 pid=852) INFO 04-12 09:20:48 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP0_EP0 pid=793) INFO 04-12 09:20:48 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP4_EP4 pid=828) INFO 04-12 09:20:48 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP0_EP0 pid=793) INFO 04-12 09:20:48 [layer.py:396] [EP Rank 0/8] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 32/256. Experts local to global index map: 0->0, 1->1, 2->2, 3->3, 4->4, 5->5, 6->6, 7->7, 8->8, 9->9, 10->10, 11->11, 12->12, 13->13, 14->14, 15->15, 16->16, 17->17, 18->18, 19->19, 20->20, 21->21, 22->22, 23->23, 24->24, 25->25, 26->26, 27->27, 28->28, 29->29, 30->30, 31->31.
(Worker_TP0_EP0 pid=793) INFO 04-12 09:20:48 [compressed_tensors_moe.py:194] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_TP0_EP0 pid=793) INFO 04-12 09:20:48 [compressed_tensors_moe.py:1180] Using Marlin backend for WNA16 MoE (group_size=32, num_bits=4)
(Worker_TP2_EP2 pid=804) INFO 04-12 09:20:48 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP5_EP5 pid=840) INFO 04-12 09:20:48 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP1_EP1 pid=798) INFO 04-12 09:20:48 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
Loading safetensors checkpoint shards: 0% Completed | 0/81 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 1% Completed | 1/81 [00:00<00:23, 3.45it/s]
Loading safetensors checkpoint shards: 2% Completed | 2/81 [00:01<00:44, 1.77it/s]
Loading safetensors checkpoint shards: 4% Completed | 3/81 [00:01<00:46, 1.68it/s]
Loading safetensors checkpoint shards: 5% Completed | 4/81 [00:02<00:47, 1.64it/s]
Loading safetensors checkpoint shards: 6% Completed | 5/81 [00:02<00:47, 1.61it/s]
Loading safetensors checkpoint shards: 7% Completed | 6/81 [00:03<00:46, 1.61it/s]
Loading safetensors checkpoint shards: 9% Completed | 7/81 [00:04<00:45, 1.61it/s]
Loading safetensors checkpoint shards: 10% Completed | 8/81 [00:04<00:46, 1.57it/s]
Loading safetensors checkpoint shards: 11% Completed | 9/81 [00:05<00:45, 1.58it/s]
Loading safetensors checkpoint shards: 12% Completed | 10/81 [00:06<00:42, 1.67it/s]
Loading safetensors checkpoint shards: 14% Completed | 11/81 [00:06<00:43, 1.62it/s]
Loading safetensors checkpoint shards: 15% Completed | 12/81 [00:07<00:42, 1.61it/s]
Loading safetensors checkpoint shards: 16% Completed | 13/81 [00:07<00:42, 1.61it/s]
Loading safetensors checkpoint shards: 17% Completed | 14/81 [00:08<00:42, 1.59it/s]
Loading safetensors checkpoint shards: 19% Completed | 15/81 [00:09<00:41, 1.58it/s]
Loading safetensors checkpoint shards: 20% Completed | 16/81 [00:09<00:40, 1.59it/s]
Loading safetensors checkpoint shards: 21% Completed | 17/81 [00:10<00:40, 1.60it/s]
Loading safetensors checkpoint shards: 22% Completed | 18/81 [00:11<00:38, 1.62it/s]
Loading safetensors checkpoint shards: 23% Completed | 19/81 [00:11<00:38, 1.62it/s]
Loading safetensors checkpoint shards: 25% Completed | 20/81 [00:12<00:37, 1.62it/s]
Loading safetensors checkpoint shards: 26% Completed | 21/81 [00:12<00:37, 1.61it/s]
Loading safetensors checkpoint shards: 27% Completed | 22/81 [00:13<00:36, 1.62it/s]
Loading safetensors checkpoint shards: 28% Completed | 23/81 [00:14<00:35, 1.62it/s]
Loading safetensors checkpoint shards: 30% Completed | 24/81 [00:14<00:35, 1.62it/s]
Loading safetensors checkpoint shards: 31% Completed | 25/81 [00:15<00:33, 1.67it/s]
Loading safetensors checkpoint shards: 32% Completed | 26/81 [00:15<00:33, 1.66it/s]
Loading safetensors checkpoint shards: 33% Completed | 27/81 [00:16<00:32, 1.65it/s]
Loading safetensors checkpoint shards: 35% Completed | 28/81 [00:17<00:35, 1.50it/s]
Loading safetensors checkpoint shards: 36% Completed | 29/81 [00:17<00:31, 1.67it/s]
Loading safetensors checkpoint shards: 37% Completed | 30/81 [00:18<00:30, 1.67it/s]
Loading safetensors checkpoint shards: 38% Completed | 31/81 [00:19<00:30, 1.65it/s]
Loading safetensors checkpoint shards: 40% Completed | 32/81 [00:19<00:29, 1.64it/s]
Loading safetensors checkpoint shards: 41% Completed | 33/81 [00:20<00:29, 1.63it/s]
Loading safetensors checkpoint shards: 42% Completed | 34/81 [00:20<00:29, 1.62it/s]
Loading safetensors checkpoint shards: 43% Completed | 35/81 [00:21<00:28, 1.61it/s]
Loading safetensors checkpoint shards: 44% Completed | 36/81 [00:22<00:27, 1.62it/s]
Loading safetensors checkpoint shards: 46% Completed | 37/81 [00:22<00:27, 1.61it/s]
Loading safetensors checkpoint shards: 47% Completed | 38/81 [00:23<00:22, 1.88it/s]
Loading safetensors checkpoint shards: 48% Completed | 39/81 [00:23<00:21, 2.00it/s]
Loading safetensors checkpoint shards: 49% Completed | 40/81 [00:23<00:19, 2.11it/s]
Loading safetensors checkpoint shards: 51% Completed | 41/81 [00:24<00:18, 2.19it/s]
Loading safetensors checkpoint shards: 52% Completed | 42/81 [00:24<00:17, 2.21it/s]
Loading safetensors checkpoint shards: 53% Completed | 43/81 [00:25<00:16, 2.26it/s]
Loading safetensors checkpoint shards: 54% Completed | 44/81 [00:25<00:16, 2.30it/s]
Loading safetensors checkpoint shards: 56% Completed | 45/81 [00:25<00:14, 2.41it/s]
Loading safetensors checkpoint shards: 57% Completed | 46/81 [00:26<00:14, 2.42it/s]
Loading safetensors checkpoint shards: 58% Completed | 47/81 [00:26<00:15, 2.21it/s]
Loading safetensors checkpoint shards: 59% Completed | 48/81 [00:27<00:13, 2.48it/s]
Loading safetensors checkpoint shards: 60% Completed | 49/81 [00:27<00:13, 2.44it/s]
Loading safetensors checkpoint shards: 62% Completed | 50/81 [00:28<00:12, 2.46it/s]
Loading safetensors checkpoint shards: 63% Completed | 51/81 [00:28<00:12, 2.43it/s]
Loading safetensors checkpoint shards: 64% Completed | 52/81 [00:28<00:12, 2.35it/s]
Loading safetensors checkpoint shards: 65% Completed | 53/81 [00:29<00:11, 2.35it/s]
Loading safetensors checkpoint shards: 67% Completed | 54/81 [00:29<00:11, 2.35it/s]
Loading safetensors checkpoint shards: 68% Completed | 55/81 [00:30<00:10, 2.38it/s]
Loading safetensors checkpoint shards: 69% Completed | 56/81 [00:30<00:10, 2.39it/s]
Loading safetensors checkpoint shards: 70% Completed | 57/81 [00:31<00:09, 2.40it/s]
Loading safetensors checkpoint shards: 72% Completed | 58/81 [00:31<00:09, 2.45it/s]
Loading safetensors checkpoint shards: 73% Completed | 59/81 [00:31<00:09, 2.42it/s]
Loading safetensors checkpoint shards: 74% Completed | 60/81 [00:32<00:08, 2.38it/s]
Loading safetensors checkpoint shards: 75% Completed | 61/81 [00:32<00:08, 2.30it/s]
Loading safetensors checkpoint shards: 77% Completed | 62/81 [00:33<00:08, 2.31it/s]
Loading safetensors checkpoint shards: 78% Completed | 63/81 [00:33<00:07, 2.33it/s]
Loading safetensors checkpoint shards: 79% Completed | 64/81 [00:33<00:07, 2.38it/s]
Loading safetensors checkpoint shards: 80% Completed | 65/81 [00:34<00:06, 2.34it/s]
Loading safetensors checkpoint shards: 81% Completed | 66/81 [00:34<00:06, 2.49it/s]
Loading safetensors checkpoint shards: 83% Completed | 67/81 [00:35<00:05, 2.50it/s]
Loading safetensors checkpoint shards: 84% Completed | 68/81 [00:35<00:05, 2.47it/s]
Loading safetensors checkpoint shards: 85% Completed | 69/81 [00:35<00:04, 2.48it/s]
Loading safetensors checkpoint shards: 86% Completed | 70/81 [00:36<00:04, 2.52it/s]
Loading safetensors checkpoint shards: 88% Completed | 71/81 [00:36<00:03, 2.52it/s]
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] WorkerProc failed to start.
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] Traceback (most recent call last):
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 826, in worker_main
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] worker = WorkerProc(*args, **kwargs)
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 613, in init
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] self.worker.load_model()
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4760, in load_model
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] self.drafter.load_model(self.model)
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/eagle.py", line 1254, in load_model
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] self.model = self._get_model()
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/eagle.py", line 1239, in _get_model
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] model = get_model(
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] ^^^^^^^^^^
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 138, in get_model
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] return loader.load_model(
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] self.load_weights(model, model_config)
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 381, in load_weights
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_mtp.py", line 430, in load_weights
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] raise ValueError(
(Worker_TP4_EP4 pid=828) ERROR 04-12 09:21:26 [multiproc_executor.py:857] ValueError: MTP speculative decoding layer 78 weights missing from checkpoint. The checkpoint may have been quantized without including the MTP layers. Use a checkpoint that includes MTP layer weights, or disable speculative decoding.
Loading safetensors checkpoint shards: 89% Completed | 72/81 [00:37<00:03, 2.52it/s]
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] WorkerProc failed to start.
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] Traceback (most recent call last):
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 826, in worker_main
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] worker = WorkerProc(*args, **kwargs)
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 613, in init
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] self.worker.load_model()
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4760, in load_model
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] self.drafter.load_model(self.model)
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/eagle.py", line 1254, in load_model
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] self.model = self._get_model()
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/eagle.py", line 1239, in _get_model
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] model = get_model(
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] ^^^^^^^^^^
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 138, in get_model
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] return loader.load_model(
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] self.load_weights(model, model_config)
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] return func(*args, **kwargs)
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 381, in load_weights
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_mtp.py", line 430, in load_weights
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] raise ValueError(
(Worker_TP6_EP6 pid=852) ERROR 04-12 09:21:27 [multiproc_executor.py:857] ValueError: MTP speculative decoding layer 78 weights missing from checkpoint. The checkpoint may have been quantized without including the MTP layers. Use a checkpoint that includes MTP layer weights, or disable speculative decoding.
Loading safetensors checkpoint shards: 90% Completed | 73/81 [00:37<00:03, 2.52it/s]
Loading safetensors checkpoint shards: 91% Completed | 74/81 [00:37<00:02, 2.51it/s]
Loading safetensors checkpoint shards: 93% Completed | 75/81 [00:38<00:02, 2.50it/s]
I already use the newest config.json it seems like still something wrong in config.json
I already use the newest config.json it seems like still something wrong in config.json
It looks like this AWQ-quantized model was not generated using the latest versions of llm-compressor and transformers—support for the GLM5.1 model architecture has only just been added in the latest version of vLLM.
At the moment, MTP layers are not implemented, so please remove any MTP tags :) such as
--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1
At the moment, MTP layers are not implemented, so please remove any MTP tags :) such as
--speculative-config.method mtp --speculative-config.num_speculative_tokens 1
ok, I will try it
At the moment, MTP layers are not implemented, so please remove any MTP tags :) such as
--speculative-config.method mtp --speculative-config.num_speculative_tokens 1ok, I will try it
Works for me, thx~~
but still cant work in full 200k ctx size,.