btbtyler09/Qwen3.5-122B-A10B-GPTQ-4bit

About your vllm-mi100

by hnhyzz - opened about 1 month ago

Hi btbtyler09,

When I was using your vllm-mi100 docker image to run Qwen3.5-397B-A17B-GPTQ-Int4, the decoded responses are all exclamation marks. For smaller models like 122B-A10B, it works fine.

May I know the possible bug point? Do you think the float16 is the reason?

btbtyler09

Owner about 1 month ago

Hi @hnhyzz ,

I ran into the same issue when I enabled AITER Unified Attention on the smaller models I tested. With only 4 GPUs I can't debug the larger model unfortunately, but I can recommend some things to try. By default, I disabled AITER UA and only enabled the AITER Triton RoPE to avoid this issue. Your best bet is to try disabling AITER entirely with VLLM_ROCM_USE_AITER=0

There really wasn't much from AITER we could run anyway, so I may have been foolish to try enabling any of it. The other thing you may consider is the TP and PP size. I think MI100s have some known issues when running TP=8, You may run TP=4 with PP=2, but I'm not sure that properly groups the 2 hives of GPUs. Maybe that's just done by setting GPU order which you can do by putting the gpu_ids in order with ROCR_VISIBLE_DEVICES.

What is your full launch command? I may be able to help debug if I see what you're using.

hnhyzz

about 1 month ago

Hi @hnhyzz ,

I ran into the same issue when I enabled AITER Unified Attention on the smaller models I tested. With only 4 GPUs I can't debug the larger model unfortunately, but I can recommend some things to try. By default, I disabled AITER UA and only enabled the AITER Triton RoPE to avoid this issue. Your best bet is to try disabling AITER entirely with VLLM_ROCM_USE_AITER=0

There really wasn't much from AITER we could run anyway, so I may have been foolish to try enabling any of it. The other thing you may consider is the TP and PP size. I think MI100s have some known issues when running TP=8, You may run TP=4 with PP=2, but I'm not sure that properly groups the 2 hives of GPUs. Maybe that's just done by setting GPU order which you can do by putting the gpu_ids in order with ROCR_VISIBLE_DEVICES.

What is your full launch command? I may be able to help debug if I see what you're using.

I disabled AITER entirely with VLLM_ROCM_USE_AITER=0, but still I got all exclamation marks.
I use the following command to start the docker and vllm.
I think TP+PP is OK, since it worked on the Qwen3.5 122B model.
I also tried pure TP=8, but it still ouputs exclamation marks.

docker run -it
--network=host
--group-add=video
--shm-size=16gb
--ipc=host
--cap-add=SYS_PTRACE
--security-opt seccomp=unconfined
--device=/dev/kfd
--device=/dev/dri/
--env VLLM_ROCM_USE_AITER=0
--env CUDA_VISIBLE_DEVICES=1,4,5,6,0,2,3,7
-v $HOME/models:/models
docker.1ms.run/btbtyler09/vllm-rocm-gfx908:v0.16.1.dev
bash

vllm serve /models/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/hash
--enforce-eager
--dtype float16
--api-key xxxxxxxx
--served-model-name qwen3.5
--gpu-memory-utilization 0.9
--max-model-len 256000
--tensor-parallel-size 4
--pipeline-parallel-size 2
--enable-auto-tool-choice --tool-call-parser qwen3_coder
--skip-mm-profiling
--limit-mm-per-prompt '{"image": 2}'

btbtyler09

Owner about 1 month ago

Hmm... I probably can't help much here to be honest. If you disabled aiter you should be falling down the standard triton kernels pipeline, which has been pretty reliable as of late. Can you see that in the startup output? Can you share that full startup log? The way I have debuged these issues has been pretty brute force, I just try kernels until something works.

hnhyzz

about 1 month ago

Hmm... I probably can't help much here to be honest. If you disabled aiter you should be falling down the standard triton kernels pipeline, which has been pretty reliable as of late. Can you see that in the startup output? Can you share that full startup log? The way I have debuged these issues has been pretty brute force, I just try kernels until something works.

Thank you for your kind help. Following is the starup log.
May I know any change has been made to the official vllm repo to support mi100?
I may compile the latest vllm to retry.

./vllm.sh
root@G482-Z52:/app# vllm serve /models/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/hash
--enforce-eager
--dtype float16
--api-key xxxxxx
--served-model-name qwen3.5
--gpu-memory-utilization 0.9
--max-model-len 256000
--tensor-parallel-size 4
--pipeline-parallel-size 2
--enable-auto-tool-choice --tool-call-parser qwen3_coder
--skip-mm-profiling
--limit-mm-per-prompt '{"image": 2}'
WARNING 03-24 00:41:20 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
(APIServer pid=13) INFO 03-24 00:41:24 [utils.py:293]
(APIServer pid=13) INFO 03-24 00:41:24 [utils.py:293] █ █ █▄ ▄█
(APIServer pid=13) INFO 03-24 00:41:24 [utils.py:293] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.16.1rc1.dev151+gd3bab5eb0
(APIServer pid=13) INFO 03-24 00:41:24 [utils.py:293] █▄█▀ █ █ █ █ model /models/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/hash
(APIServer pid=13) INFO 03-24 00:41:24 [utils.py:293] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=13) INFO 03-24 00:41:24 [utils.py:293]
(APIServer pid=13) INFO 03-24 00:41:24 [utils.py:229] non-default args: {'model_tag': '/models/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/hash', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'api_key': ['xxxxxx'], 'model': '/models/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/hash', 'dtype': 'float16', 'max_model_len': 256000, 'enforce_eager': True, 'served_model_name': ['qwen3.5'], 'pipeline_parallel_size': 2, 'tensor_parallel_size': 4, 'limit_mm_per_prompt': {'image': 2}, 'skip_mm_profiling': True}
(APIServer pid=13) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'}
(APIServer pid=13) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'}
(APIServer pid=13) INFO 03-24 00:41:33 [model.py:530] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=13) WARNING 03-24 00:41:33 [model.py:1891] Casting torch.bfloat16 to torch.float16.
(APIServer pid=13) INFO 03-24 00:41:33 [model.py:1553] Using max model len 256000
(APIServer pid=13) [aiter] WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing
(APIServer pid=13) [2026-03-24 00:41:34] WARNING core.py:479: WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing
(APIServer pid=13) [aiter] start build [module_aiter_enum] under /workspace/aiter/aiter/jit/build/module_aiter_enum
(APIServer pid=13) [2026-03-24 00:41:34] INFO core.py:550: start build [module_aiter_enum] under /workspace/aiter/aiter/jit/build/module_aiter_enum
(APIServer pid=13) [aiter] finish build [module_aiter_enum], cost 17.5s
(APIServer pid=13) [2026-03-24 00:41:52] INFO core.py:700: finish build [module_aiter_enum], cost 17.5s
(APIServer pid=13) [aiter] import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so
(APIServer pid=13) [2026-03-24 00:41:52] INFO core.py:502: import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so
(APIServer pid=13) INFO 03-24 00:41:53 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=13) INFO 03-24 00:41:53 [config.py:536] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=13) INFO 03-24 00:41:53 [config.py:567] Padding mamba page size by 1.34% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=13) WARNING 03-24 00:41:53 [gptq.py:99] Currently, the 4-bit gptq_gemm kernel for GPTQ is buggy. Please switch to gptq_marlin.
(APIServer pid=13) INFO 03-24 00:41:55 [vllm.py:747] Asynchronous scheduling is enabled.
(APIServer pid=13) WARNING 03-24 00:41:55 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=13) WARNING 03-24 00:41:55 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=13) INFO 03-24 00:41:55 [vllm.py:930] Cudagraph is disabled under eager mode
WARNING 03-24 00:42:16 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
(EngineCore_DP0 pid=502) INFO 03-24 00:42:17 [core.py:101] Initializing a V1 LLM engine (v0.16.1rc1.dev151+gd3bab5eb0) with config: model='/models/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/hash', speculative_config=None, tokenizer='/models/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/hash', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=256000, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=2, data_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen3.5, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+sparse_attn_indexer', 'all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False, 'fuse_rope_kvcache': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=502) WARNING 03-24 00:42:17 [multiproc_executor.py:945] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=502) INFO 03-24 00:42:17 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.176.26.34 (local), world_size=8, local_world_size=8
WARNING 03-24 00:42:24 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
(Worker pid=604) INFO 03-24 00:42:28 [parallel_state.py:1392] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:49609 backend=nccl
WARNING 03-24 00:42:32 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
(Worker pid=610) INFO 03-24 00:42:35 [parallel_state.py:1392] world_size=8 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:49609 backend=nccl
WARNING 03-24 00:42:40 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
(Worker pid=626) INFO 03-24 00:42:43 [parallel_state.py:1392] world_size=8 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:49609 backend=nccl
WARNING 03-24 00:42:48 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
(Worker pid=646) INFO 03-24 00:42:51 [parallel_state.py:1392] world_size=8 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:49609 backend=nccl
WARNING 03-24 00:42:56 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
(Worker pid=666) INFO 03-24 00:42:59 [parallel_state.py:1392] world_size=8 rank=4 local_rank=4 distributed_init_method=tcp://127.0.0.1:49609 backend=nccl
WARNING 03-24 00:43:04 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
(Worker pid=686) INFO 03-24 00:43:07 [parallel_state.py:1392] world_size=8 rank=5 local_rank=5 distributed_init_method=tcp://127.0.0.1:49609 backend=nccl
WARNING 03-24 00:43:11 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
(Worker pid=706) INFO 03-24 00:43:15 [parallel_state.py:1392] world_size=8 rank=6 local_rank=6 distributed_init_method=tcp://127.0.0.1:49609 backend=nccl
WARNING 03-24 00:43:19 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
(Worker pid=726) INFO 03-24 00:43:23 [parallel_state.py:1392] world_size=8 rank=7 local_rank=7 distributed_init_method=tcp://127.0.0.1:49609 backend=nccl
(Worker pid=604) INFO 03-24 00:43:23 [pynccl.py:111] vLLM is using nccl==2.26.6
(Worker pid=626) INFO 03-24 00:43:32 [parallel_state.py:1714] rank 2 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 2, EP rank 2, EPLB rank N/A
(Worker pid=604) INFO 03-24 00:43:32 [parallel_state.py:1714] rank 0 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=646) INFO 03-24 00:43:32 [parallel_state.py:1714] rank 3 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 3, EP rank 3, EPLB rank N/A
(Worker pid=610) INFO 03-24 00:43:32 [parallel_state.py:1714] rank 1 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1, EPLB rank N/A
(Worker pid=706) INFO 03-24 00:43:32 [parallel_state.py:1714] rank 6 in world size 8 is assigned as DP rank 0, PP rank 1, PCP rank 0, TP rank 2, EP rank 2, EPLB rank N/A
(Worker pid=666) INFO 03-24 00:43:32 [parallel_state.py:1714] rank 4 in world size 8 is assigned as DP rank 0, PP rank 1, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=686) INFO 03-24 00:43:32 [parallel_state.py:1714] rank 5 in world size 8 is assigned as DP rank 0, PP rank 1, PCP rank 0, TP rank 1, EP rank 1, EPLB rank N/A
(Worker pid=726) INFO 03-24 00:43:32 [parallel_state.py:1714] rank 7 in world size 8 is assigned as DP rank 0, PP rank 1, PCP rank 0, TP rank 3, EP rank 3, EPLB rank N/A
(Worker pid=686) INFO 03-24 00:43:43 [base.py:106] Offloader set to NoopOffloader
(Worker pid=706) INFO 03-24 00:43:43 [base.py:106] Offloader set to NoopOffloader
(Worker pid=610) INFO 03-24 00:43:43 [base.py:106] Offloader set to NoopOffloader
(Worker pid=626) INFO 03-24 00:43:43 [base.py:106] Offloader set to NoopOffloader
(Worker pid=604) INFO 03-24 00:43:43 [base.py:106] Offloader set to NoopOffloader
(Worker pid=604) (Worker_PP0_TP0 pid=604) INFO 03-24 00:43:43 [gpu_model_runner.py:4202] Starting to load model /models/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/hash...
(Worker pid=666) INFO 03-24 00:43:43 [base.py:106] Offloader set to NoopOffloader
(Worker pid=726) INFO 03-24 00:43:43 [base.py:106] Offloader set to NoopOffloader
(Worker pid=686) (Worker_PP1_TP1 pid=686) INFO 03-24 00:43:44 [rocm.py:556] Using Flash Attention backend for ViT model.
(Worker pid=686) (Worker_PP1_TP1 pid=686) WARNING 03-24 00:43:44 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none').
(Worker pid=706) (Worker_PP1_TP2 pid=706) INFO 03-24 00:43:44 [rocm.py:556] Using Flash Attention backend for ViT model.
(Worker pid=706) (Worker_PP1_TP2 pid=706) WARNING 03-24 00:43:44 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none').
(Worker pid=686) (Worker_PP1_TP1 pid=686) INFO 03-24 00:43:44 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker pid=706) (Worker_PP1_TP2 pid=706) INFO 03-24 00:43:44 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker pid=610) (Worker_PP0_TP1 pid=610) INFO 03-24 00:43:44 [rocm.py:556] Using Flash Attention backend for ViT model.
(Worker pid=610) (Worker_PP0_TP1 pid=610) WARNING 03-24 00:43:44 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none').
(Worker pid=610) (Worker_PP0_TP1 pid=610) INFO 03-24 00:43:44 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker pid=626) (Worker_PP0_TP2 pid=626) INFO 03-24 00:43:44 [rocm.py:556] Using Flash Attention backend for ViT model.
(Worker pid=626) (Worker_PP0_TP2 pid=626) WARNING 03-24 00:43:44 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none').
(Worker pid=626) (Worker_PP0_TP2 pid=626) INFO 03-24 00:43:44 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker pid=604) (Worker_PP0_TP0 pid=604) INFO 03-24 00:43:44 [rocm.py:556] Using Flash Attention backend for ViT model.
(Worker pid=604) (Worker_PP0_TP0 pid=604) WARNING 03-24 00:43:44 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none').
(Worker pid=666) (Worker_PP1_TP0 pid=666) INFO 03-24 00:43:44 [rocm.py:556] Using Flash Attention backend for ViT model.
(Worker pid=666) (Worker_PP1_TP0 pid=666) WARNING 03-24 00:43:44 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none').
(Worker pid=604) (Worker_PP0_TP0 pid=604) INFO 03-24 00:43:44 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker pid=666) (Worker_PP1_TP0 pid=666) INFO 03-24 00:43:44 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker pid=646) INFO 03-24 00:43:44 [base.py:106] Offloader set to NoopOffloader
(Worker pid=726) (Worker_PP1_TP3 pid=726) INFO 03-24 00:43:45 [rocm.py:556] Using Flash Attention backend for ViT model.
(Worker pid=726) (Worker_PP1_TP3 pid=726) WARNING 03-24 00:43:45 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none').
(Worker pid=726) (Worker_PP1_TP3 pid=726) INFO 03-24 00:43:45 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker pid=646) (Worker_PP0_TP3 pid=646) INFO 03-24 00:43:45 [rocm.py:556] Using Flash Attention backend for ViT model.
(Worker pid=646) (Worker_PP0_TP3 pid=646) WARNING 03-24 00:43:45 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none').
(Worker pid=646) (Worker_PP0_TP3 pid=646) INFO 03-24 00:43:45 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker pid=706) (Worker_PP1_TP2 pid=706) INFO 03-24 00:43:47 [rocm.py:510] Using Triton Attention backend.
(Worker pid=686) (Worker_PP1_TP1 pid=686) INFO 03-24 00:43:47 [rocm.py:510] Using Triton Attention backend.
(Worker pid=610) (Worker_PP0_TP1 pid=610) INFO 03-24 00:43:47 [rocm.py:510] Using Triton Attention backend.
(Worker pid=666) (Worker_PP1_TP0 pid=666) INFO 03-24 00:43:47 [rocm.py:510] Using Triton Attention backend.
(Worker pid=626) (Worker_PP0_TP2 pid=626) INFO 03-24 00:43:47 [rocm.py:510] Using Triton Attention backend.
(Worker pid=604) (Worker_PP0_TP0 pid=604) INFO 03-24 00:43:47 [rocm.py:510] Using Triton Attention backend.
(Worker pid=726) (Worker_PP1_TP3 pid=726) INFO 03-24 00:43:48 [rocm.py:510] Using Triton Attention backend.
(Worker pid=646) (Worker_PP0_TP3 pid=646) INFO 03-24 00:43:49 [rocm.py:510] Using Triton Attention backend.
(Worker pid=610) (Worker_PP0_TP1 pid=610) WARNING 03-24 00:43:51 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect
(Worker pid=706) (Worker_PP1_TP2 pid=706) WARNING 03-24 00:43:51 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect
(Worker pid=604) (Worker_PP0_TP0 pid=604) WARNING 03-24 00:43:52 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect
(Worker pid=686) (Worker_PP1_TP1 pid=686) WARNING 03-24 00:43:52 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect
(Worker pid=626) (Worker_PP0_TP2 pid=626) WARNING 03-24 00:43:52 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect
(Worker pid=666) (Worker_PP1_TP0 pid=666) WARNING 03-24 00:43:52 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect
Loading safetensors checkpoint shards: 0% Completed | 0/94 [00:00<?, ?it/s]
(Worker pid=726) (Worker_PP1_TP3 pid=726) WARNING 03-24 00:43:53 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect
Loading safetensors checkpoint shards: 1% Completed | 1/94 [00:00<00:48, 1.92it/s]
Loading safetensors checkpoint shards: 2% Completed | 2/94 [00:01<00:47, 1.93it/s]
(Worker pid=646) (Worker_PP0_TP3 pid=646) WARNING 03-24 00:43:53 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect
Loading safetensors checkpoint shards: 3% Completed | 3/94 [00:02<01:29, 1.02it/s]
Loading safetensors checkpoint shards: 4% Completed | 4/94 [00:04<01:52, 1.25s/it]
Loading safetensors checkpoint shards: 5% Completed | 5/94 [00:05<02:05, 1.41s/it]
Loading safetensors checkpoint shards: 6% Completed | 6/94 [00:06<01:40, 1.15s/it]
Loading safetensors checkpoint shards: 7% Completed | 7/94 [00:08<01:51, 1.28s/it]
Loading safetensors checkpoint shards: 9% Completed | 8/94 [00:09<02:01, 1.42s/it]
Loading safetensors checkpoint shards: 10% Completed | 9/94 [00:11<02:08, 1.51s/it]
Loading safetensors checkpoint shards: 11% Completed | 10/94 [00:12<01:44, 1.24s/it]
Loading safetensors checkpoint shards: 12% Completed | 11/94 [00:13<01:51, 1.34s/it]
Loading safetensors checkpoint shards: 13% Completed | 12/94 [00:15<01:59, 1.46s/it]
Loading safetensors checkpoint shards: 14% Completed | 13/94 [00:17<02:04, 1.54s/it]
Loading safetensors checkpoint shards: 15% Completed | 14/94 [00:17<01:40, 1.26s/it]
Loading safetensors checkpoint shards: 16% Completed | 15/94 [00:19<01:47, 1.36s/it]
Loading safetensors checkpoint shards: 17% Completed | 16/94 [00:21<01:54, 1.47s/it]
Loading safetensors checkpoint shards: 18% Completed | 17/94 [00:21<01:33, 1.22s/it]
Loading safetensors checkpoint shards: 19% Completed | 18/94 [00:23<01:40, 1.32s/it]
Loading safetensors checkpoint shards: 20% Completed | 19/94 [00:25<01:47, 1.44s/it]
Loading safetensors checkpoint shards: 21% Completed | 20/94 [00:26<01:52, 1.52s/it]
Loading safetensors checkpoint shards: 22% Completed | 21/94 [00:27<01:31, 1.25s/it]
Loading safetensors checkpoint shards: 23% Completed | 22/94 [00:27<01:14, 1.03s/it]
Loading safetensors checkpoint shards: 24% Completed | 23/94 [00:28<01:02, 1.14it/s]
Loading safetensors checkpoint shards: 26% Completed | 24/94 [00:28<00:53, 1.30it/s]
Loading safetensors checkpoint shards: 27% Completed | 25/94 [00:30<01:09, 1.01s/it]
Loading safetensors checkpoint shards: 28% Completed | 26/94 [00:31<01:01, 1.11it/s]
Loading safetensors checkpoint shards: 29% Completed | 27/94 [00:32<01:13, 1.10s/it]
Loading safetensors checkpoint shards: 30% Completed | 28/94 [00:34<01:24, 1.28s/it]
Loading safetensors checkpoint shards: 31% Completed | 29/94 [00:36<01:30, 1.40s/it]
Loading safetensors checkpoint shards: 32% Completed | 30/94 [00:36<01:14, 1.17s/it]
Loading safetensors checkpoint shards: 33% Completed | 31/94 [00:38<01:20, 1.29s/it]
Loading safetensors checkpoint shards: 34% Completed | 32/94 [00:38<01:07, 1.09s/it]
Loading safetensors checkpoint shards: 35% Completed | 33/94 [00:39<00:56, 1.09it/s]
Loading safetensors checkpoint shards: 36% Completed | 34/94 [00:40<01:06, 1.11s/it]
Loading safetensors checkpoint shards: 37% Completed | 35/94 [00:42<01:15, 1.28s/it]
Loading safetensors checkpoint shards: 38% Completed | 36/94 [00:43<01:02, 1.08s/it]
Loading safetensors checkpoint shards: 39% Completed | 37/94 [00:44<01:09, 1.23s/it]
Loading safetensors checkpoint shards: 40% Completed | 38/94 [00:46<01:16, 1.37s/it]
Loading safetensors checkpoint shards: 41% Completed | 39/94 [00:47<01:03, 1.15s/it]
Loading safetensors checkpoint shards: 43% Completed | 40/94 [00:48<01:08, 1.28s/it]
Loading safetensors checkpoint shards: 44% Completed | 41/94 [00:50<01:14, 1.41s/it]
Loading safetensors checkpoint shards: 45% Completed | 42/94 [00:52<01:18, 1.50s/it]
Loading safetensors checkpoint shards: 46% Completed | 43/94 [00:53<01:20, 1.57s/it]
Loading safetensors checkpoint shards: 47% Completed | 44/94 [00:54<01:04, 1.29s/it]
Loading safetensors checkpoint shards: 48% Completed | 45/94 [00:56<01:07, 1.37s/it]
Loading safetensors checkpoint shards: 49% Completed | 46/94 [00:57<01:10, 1.48s/it]
Loading safetensors checkpoint shards: 50% Completed | 47/94 [00:59<01:13, 1.55s/it]
Loading safetensors checkpoint shards: 51% Completed | 48/94 [01:00<00:58, 1.28s/it]
Loading safetensors checkpoint shards: 52% Completed | 49/94 [01:00<00:47, 1.05s/it]
Loading safetensors checkpoint shards: 53% Completed | 50/94 [01:01<00:39, 1.12it/s]
Loading safetensors checkpoint shards: 54% Completed | 51/94 [01:01<00:33, 1.28it/s]
Loading safetensors checkpoint shards: 55% Completed | 52/94 [01:02<00:29, 1.42it/s]
Loading safetensors checkpoint shards: 56% Completed | 53/94 [01:02<00:26, 1.54it/s]
Loading safetensors checkpoint shards: 57% Completed | 54/94 [01:03<00:24, 1.63it/s]
Loading safetensors checkpoint shards: 59% Completed | 55/94 [01:03<00:22, 1.71it/s]
Loading safetensors checkpoint shards: 60% Completed | 56/94 [01:04<00:21, 1.76it/s]
Loading safetensors checkpoint shards: 61% Completed | 57/94 [01:04<00:20, 1.81it/s]
Loading safetensors checkpoint shards: 62% Completed | 58/94 [01:05<00:19, 1.84it/s]
Loading safetensors checkpoint shards: 63% Completed | 59/94 [01:05<00:18, 1.85it/s]
Loading safetensors checkpoint shards: 64% Completed | 60/94 [01:06<00:18, 1.87it/s]
Loading safetensors checkpoint shards: 65% Completed | 61/94 [01:07<00:17, 1.88it/s]
Loading safetensors checkpoint shards: 66% Completed | 62/94 [01:08<00:27, 1.18it/s]
Loading safetensors checkpoint shards: 67% Completed | 63/94 [01:09<00:29, 1.07it/s]
Loading safetensors checkpoint shards: 68% Completed | 64/94 [01:11<00:34, 1.15s/it]
Loading safetensors checkpoint shards: 69% Completed | 65/94 [01:12<00:33, 1.17s/it]
Loading safetensors checkpoint shards: 70% Completed | 66/94 [01:14<00:36, 1.31s/it]
Loading safetensors checkpoint shards: 71% Completed | 67/94 [01:15<00:34, 1.28s/it]
Loading safetensors checkpoint shards: 72% Completed | 68/94 [01:17<00:36, 1.39s/it]
Loading safetensors checkpoint shards: 73% Completed | 69/94 [01:18<00:33, 1.33s/it]
Loading safetensors checkpoint shards: 74% Completed | 70/94 [01:19<00:34, 1.42s/it]
Loading safetensors checkpoint shards: 76% Completed | 71/94 [01:20<00:27, 1.19s/it]
Loading safetensors checkpoint shards: 77% Completed | 72/94 [01:21<00:21, 1.01it/s]
Loading safetensors checkpoint shards: 78% Completed | 73/94 [01:22<00:21, 1.01s/it]
Loading safetensors checkpoint shards: 79% Completed | 74/94 [01:23<00:23, 1.20s/it]
Loading safetensors checkpoint shards: 80% Completed | 75/94 [01:24<00:22, 1.19s/it]
Loading safetensors checkpoint shards: 81% Completed | 76/94 [01:26<00:20, 1.16s/it]
Loading safetensors checkpoint shards: 82% Completed | 77/94 [01:27<00:19, 1.15s/it]
Loading safetensors checkpoint shards: 83% Completed | 78/94 [01:28<00:18, 1.14s/it]
Loading safetensors checkpoint shards: 84% Completed | 79/94 [01:29<00:19, 1.30s/it]
Loading safetensors checkpoint shards: 85% Completed | 80/94 [01:31<00:17, 1.27s/it]
Loading safetensors checkpoint shards: 86% Completed | 81/94 [01:32<00:18, 1.40s/it]
Loading safetensors checkpoint shards: 87% Completed | 82/94 [01:34<00:16, 1.34s/it]
Loading safetensors checkpoint shards: 88% Completed | 83/94 [01:35<00:15, 1.44s/it]
Loading safetensors checkpoint shards: 89% Completed | 84/94 [01:36<00:13, 1.38s/it]
Loading safetensors checkpoint shards: 90% Completed | 85/94 [01:37<00:10, 1.14s/it]
Loading safetensors checkpoint shards: 91% Completed | 86/94 [01:38<00:07, 1.05it/s]
Loading safetensors checkpoint shards: 93% Completed | 87/94 [01:38<00:05, 1.21it/s]
Loading safetensors checkpoint shards: 94% Completed | 88/94 [01:39<00:04, 1.35it/s]
Loading safetensors checkpoint shards: 95% Completed | 89/94 [01:39<00:03, 1.48it/s]
Loading safetensors checkpoint shards: 96% Completed | 90/94 [01:40<00:02, 1.58it/s]
Loading safetensors checkpoint shards: 97% Completed | 91/94 [01:40<00:02, 1.47it/s]
Loading safetensors checkpoint shards: 98% Completed | 92/94 [01:41<00:01, 1.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 94/94 [01:42<00:00, 2.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 94/94 [01:42<00:00, 1.09s/it]
(Worker pid=604) (Worker_PP0_TP0 pid=604)
(Worker pid=604) (Worker_PP0_TP0 pid=604) INFO 03-24 00:45:34 [default_loader.py:293] Loading weights took 102.19 seconds
(Worker pid=604) (Worker_PP0_TP0 pid=604) INFO 03-24 00:45:35 [gpu_model_runner.py:4285] Model loading took 25.99 GiB memory and 111.068924 seconds
(Worker pid=666) (Worker_PP1_TP0 pid=666) INFO 03-24 00:45:45 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache.
(Worker pid=626) (Worker_PP0_TP2 pid=626) INFO 03-24 00:45:45 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache.
(Worker pid=706) (Worker_PP1_TP2 pid=706) INFO 03-24 00:45:45 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache.
(Worker pid=604) (Worker_PP0_TP0 pid=604) INFO 03-24 00:45:45 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache.
(Worker pid=686) (Worker_PP1_TP1 pid=686) INFO 03-24 00:45:45 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache.
(Worker pid=726) (Worker_PP1_TP3 pid=726) INFO 03-24 00:45:45 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache.
(Worker pid=646) (Worker_PP0_TP3 pid=646) INFO 03-24 00:45:45 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache.
(Worker pid=610) (Worker_PP0_TP1 pid=610) INFO 03-24 00:45:45 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache.
(Worker pid=604) (Worker_PP0_TP0 pid=604) WARNING 03-24 00:45:48 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=256,device_name=Arcturus_GL-XL_[Instinct_MI100],dtype=int4_w4a16.json
(Worker pid=604) (Worker_PP0_TP0 pid=604) INFO 03-24 00:45:52 [gpu_worker.py:423] Available KV cache memory: 2.32 GiB
(EngineCore_DP0 pid=502) INFO 03-24 00:45:57 [kv_cache_utils.py:1314] GPU KV cache size: 25,344 tokens
(EngineCore_DP0 pid=502) INFO 03-24 00:45:57 [kv_cache_utils.py:1319] Maximum concurrency for 256,000 tokens per request: 0.39x
(EngineCore_DP0 pid=502) INFO 03-24 00:46:00 [core.py:282] init engine (profile, create kv cache, warmup model) took 15.46 seconds
(EngineCore_DP0 pid=502) INFO 03-24 00:46:12 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=502) WARNING 03-24 00:46:12 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore_DP0 pid=502) WARNING 03-24 00:46:12 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=502) INFO 03-24 00:46:12 [vllm.py:930] Cudagraph is disabled under eager mode
(APIServer pid=13) INFO 03-24 00:46:12 [api_server.py:495] Supported tasks: ['generate']
(APIServer pid=13) INFO 03-24 00:46:13 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=13) WARNING 03-24 00:46:13 [model.py:1354] Default vLLM sampling parameters have been overridden by the model's generation_config.json: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}. If this is not intended, please relaunch vLLM instance with --generation-config vllm.
(APIServer pid=13) INFO 03-24 00:46:13 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=13) INFO 03-24 00:46:13 [serving.py:185] Warming up chat template processing...
(APIServer pid=13) INFO 03-24 00:46:16 [hf.py:318] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
(APIServer pid=13) INFO 03-24 00:46:16 [serving.py:210] Chat template warmup completed in 2870.8ms
(APIServer pid=13) INFO 03-24 00:46:16 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=13) INFO 03-24 00:46:16 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:38] Available routes are:
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=13) INFO 03-24 00:46:16 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=13) INFO: Started server process [13]
(APIServer pid=13) INFO: Waiting for application startup.
(APIServer pid=13) INFO: Application startup complete.
(APIServer pid=13) INFO: 172.17.0.2:48036 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=13) INFO: 172.17.0.2:48048 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=502) INFO 03-24 00:47:53 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker pid=604) (Worker_PP0_TP0 pid=604) /usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py:659: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /app/pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
(Worker pid=604) (Worker_PP0_TP0 pid=604) object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
[rank0]:[W324 00:48:00.134156939 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
[rank4]:[W324 00:48:00.134717897 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
(Worker pid=610) (Worker_PP0_TP1 pid=610) /usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py:659: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /app/pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
(Worker pid=610) (Worker_PP0_TP1 pid=610) object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
[rank1]:[W324 00:48:00.156147939 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
[rank5]:[W324 00:48:00.156511758 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
(Worker pid=626) (Worker_PP0_TP2 pid=626) /usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py:659: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /app/pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
(Worker pid=626) (Worker_PP0_TP2 pid=626) object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
[rank2]:[W324 00:48:00.159927777 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
[rank6]:[W324 00:48:00.160370112 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
(Worker pid=646) (Worker_PP0_TP3 pid=646) /usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py:659: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /app/pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
(Worker pid=646) (Worker_PP0_TP3 pid=646) object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
[rank3]:[W324 00:48:00.371576276 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
[rank7]:[W324 00:48:00.372057229 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
(EngineCore_DP0 pid=502) INFO 03-24 00:48:53 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(APIServer pid=13) INFO 03-24 00:49:06 [loggers.py:259] Engine 000: Avg prompt throughput: 17.9 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.2%, Prefix cache hit rate: 0.0%
(APIServer pid=13) INFO 03-24 00:49:16 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.2%, Prefix cache hit rate: 0.0%
(APIServer pid=13) INFO 03-24 00:49:26 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.2%, Prefix cache hit rate: 0.0%
(APIServer pid=13) INFO 03-24 00:49:36 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=13) INFO 03-24 00:49:46 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
^C(Worker pid=666) (Worker_PP1_TP0 pid=666) WARNING 03-24 00:51:42 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=666) (Worker_PP1_TP0 pid=666) INFO 03-24 00:51:42 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=706) (Worker_PP1_TP2 pid=706) WARNING 03-24 00:51:42 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=726) (Worker_PP1_TP3 pid=726) WARNING 03-24 00:51:42 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=604) (Worker_PP0_TP0 pid=604) INFO 03-24 00:51:42 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=646) (Worker_PP0_TP3 pid=646) WARNING 03-24 00:51:42 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=686) (Worker_PP1_TP1 pid=686) WARNING 03-24 00:51:42 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=626) (Worker_PP0_TP2 pid=626) WARNING 03-24 00:51:42 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=686) (Worker_PP1_TP1 pid=686) INFO 03-24 00:51:42 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=610) (Worker_PP0_TP1 pid=610) WARNING 03-24 00:51:42 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=726) (Worker_PP1_TP3 pid=726) INFO 03-24 00:51:42 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=706) (Worker_PP1_TP2 pid=706) INFO 03-24 00:51:42 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=646) (Worker_PP0_TP3 pid=646) INFO 03-24 00:51:42 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=626) (Worker_PP0_TP2 pid=626) INFO 03-24 00:51:42 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=610) (Worker_PP0_TP1 pid=610) INFO 03-24 00:51:42 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=604) (Worker_PP0_TP0 pid=604) WARNING 03-24 00:51:42 [multiproc_executor.py:814] WorkerProc was terminated
(APIServer pid=13) INFO 03-24 00:51:42 [launcher.py:122] Shutting down FastAPI HTTP server.
(APIServer pid=13) INFO: Shutting down
(APIServer pid=13) INFO: Waiting for application shutdown.
(APIServer pid=13) INFO: Application shutdown complete.

hnhyzz

30 days ago

Hmm... I probably can't help much here to be honest. If you disabled aiter you should be falling down the standard triton kernels pipeline, which has been pretty reliable as of late. Can you see that in the startup output? Can you share that full startup log? The way I have debuged these issues has been pretty brute force, I just try kernels until something works.

Do you think it to be a result of casting bf16 to float16?
will float 16 be overflown for the larger model ?

btbtyler09

Owner 30 days ago

that all looks normal. casting to float16 really shouldn't matter.

Can you tell me which quant you are using? Is it Qwen's own GPTQ quant? I notice they used group sizes of 128. Models with this group size have given me issues in the past. I'm not entirely sure why, but I think it's to do with tiling of the GPTQ dequantization kernels. Sometimes the larger group sizes seem to cause issues on mi100s.

I don't have the system memory to quantize the larger model, but if you have enough, I could help you with a quantization container/script. It's a bit of a bear getting these gpus to behave when quantizing, and a model this size will take a few days probably, but with 256 GB of vram it should be possible if you have ~1TB of system ram.

Otherwise, I think I would suggest you fall back to using llama.cpp for very large models like this. It's going to be easier to setup and get running.

hnhyzz

30 days ago

that all looks normal. casting to float16 really shouldn't matter.

Can you tell me which quant you are using? Is it Qwen's own GPTQ quant? I notice they used group sizes of 128. Models with this group size have given me issues in the past. I'm not entirely sure why, but I think it's to do with tiling of the GPTQ dequantization kernels. Sometimes the larger group sizes seem to cause issues on mi100s.

I don't have the system memory to quantize the larger model, but if you have enough, I could help you with a quantization container/script. It's a bit of a bear getting these gpus to behave when quantizing, and a model this size will take a few days probably, but with 256 GB of vram it should be possible if you have ~1TB of system ram.

Otherwise, I think I would suggest you fall back to using llama.cpp for very large models like this. It's going to be easier to setup and get running.

Yes I am using Qwen3.5's default GPTQ quant. The model was downloaded from huggingface Qwen repo. I also noticed that the log says GPTQ_gemm kernel is buggy. However, vllm GPTQ_marlin kernel does not support float16.

I considered switch to llama.cpp. Nonethelss, the PP+TP stategy on llama.cpp does not support pipilening, which halves the throughput.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment