Working vLLM setup on RTX 5090 — 194-197 tok/s with image/video

#3
by 8055izham - opened

Environment:

Windows 11 + WSL2 Ubuntu 24.04
RTX 5090 32GB, NVIDIA driver 581.57 (CUDA 13.0)
vLLM image: vllm/vllm-openai:cu130-nightly

docker run --gpus all
--ipc=host
-v ~/.cache/huggingface:/root/.cache/huggingface
-v ~/.cache/vllm:/root/.cache/vllm
vllm/vllm-openai:cu130-nightly
Qwen/Qwen3.5-35B-A3B-GPTQ-Int4
--quantization gptq_marlin
--dtype bfloat16
--kv-cache-dtype fp8
--gpu-memory-utilization 0.9
--max-model-len 131072
--max-num-batched-tokens 2096
--enable-prefix-caching
--reasoning-parser qwen3
--trust-remote-code
--host 0.0.0.0
--port 8000

Notes:
--dtype bfloat16 is required — float16 crashes due to the mixed float16/bfloat16 weight dtypes in this GPTQ release
--quantization gptq_marlin for optimized kernel
--kv-cache-dtype fp8 needed to fit 131K context in 32GB

Image and video input works out of the box (no extra flags needed)
Generation speed: ~194-197 tok/s sustained
VRAM usage: ~31.3 GB

Use case note:
Testing this for Clinical Decision Support. Early results are competitive with my existing Qwen3-235B-A22B-Instruct-2507 setup (on a different PC) — impressive for a 35B MoE at this speed. 🫡

hey @8055izham , why you set --max-num-batched-tokens 2096?

does that mean every request can just contain 2096 tokens? Please help

Hi @hdnh2006 ,

I couldn't get vLLM to start without it due to this error:
Assertion failed, In Mamba cache align mode, block_size (2096) must be <= max_num_batched_tokens (2048).

Hi @hdnh2006 ,

I couldn't get vLLM to start without it due to this error:
Assertion failed, In Mamba cache align mode, block_size (2096) must be <= max_num_batched_tokens (2048).

Thank you @8055izham , in my RTX5090 I can load the model but then I do some request and I get CUDA out of memory, I reduced the --max-model-len to 8192 and I get the same error:
I am getting this error with vLLM:

vllm serve Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 --quantization gptq_marlin --dtype bfloat16 --kv-cache-dtype fp8 --gpu-memory-utilization 0.9 --max-model-len 8192 --max-num-batched-tokens 2096 --enable-prefix-caching --reasoning-parser qwen3 --trust-remote-code  --host 0.0.0.0 --port 8000
(APIServer pid=5323) INFO 03-05 10:39:19 [utils.py:302] 
(APIServer pid=5323) INFO 03-05 10:39:19 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=5323) INFO 03-05 10:39:19 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.16.1rc1.dev206+g097eb544e
(APIServer pid=5323) INFO 03-05 10:39:19 [utils.py:302]   █▄█▀ █     █     █     █  model   Qwen/Qwen3.5-35B-A3B-GPTQ-Int4
(APIServer pid=5323) INFO 03-05 10:39:19 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=5323) INFO 03-05 10:39:19 [utils.py:302] 
(APIServer pid=5323) INFO 03-05 10:39:19 [utils.py:238] non-default args:  {'model_tag': 'Qwen/Qwen3.5-35B-A3B-GPTQ-Int4', 'host': '0.0.0.0', 'model': 'Qwen/Qwen3.5-35B-A3B-GPTQ-Int4', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 8192, 'quantization': 'gptq_marlin', 'reasoning_parser': 'qwen3', 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'max_num_batched_tokens': 2096}
(APIServer pid=5323) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=5323) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=5323) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=5323) INFO 03-05 10:39:20 [model.py:530] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=5323) INFO 03-05 10:39:20 [model.py:1553] Using max model len 8192
(APIServer pid=5323) INFO 03-05 10:39:20 [gptq_marlin.py:229] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
(APIServer pid=5323) INFO 03-05 10:39:20 [cache.py:223] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=5323) INFO 03-05 10:39:20 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2096.
(APIServer pid=5323) WARNING 03-05 10:39:20 [config.py:381] Mamba cache mode is set to 'align' for Qwen3_5MoeForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=5323) INFO 03-05 10:39:20 [config.py:401] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=5323) INFO 03-05 10:39:20 [config.py:544] Setting attention block size to 2096 tokens to ensure that attention page size is >= mamba page size.
Parse safetensors files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:02<00:00,  6.84it/s]
(APIServer pid=5323) INFO 03-05 10:39:24 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=5546) INFO 03-05 10:39:42 [core.py:101] Initializing a V1 LLM engine (v0.16.1rc1.dev206+g097eb544e) with config: model='Qwen/Qwen3.5-35B-A3B-GPTQ-Int4', speculative_config=None, tokenizer='Qwen/Qwen3.5-35B-A3B-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3.5-35B-A3B-GPTQ-Int4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=5546) INFO 03-05 10:39:44 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.2:45553 backend=nccl
(EngineCore_DP0 pid=5546) INFO 03-05 10:39:44 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore_DP0 pid=5546) INFO 03-05 10:39:52 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=5546) INFO 03-05 10:39:53 [gpu_model_runner.py:4255] Starting to load model Qwen/Qwen3.5-35B-A3B-GPTQ-Int4...
(EngineCore_DP0 pid=5546) INFO 03-05 10:39:53 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=5546) INFO 03-05 10:39:53 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=5546) INFO 03-05 10:39:54 [cuda.py:405] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/14 [00:01<00:19,  1.49s/it]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:02<00:17,  1.48s/it]
Loading safetensors checkpoint shards:  21% Completed | 3/14 [00:04<00:16,  1.51s/it]
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:06<00:15,  1.58s/it]
Loading safetensors checkpoint shards:  36% Completed | 5/14 [00:07<00:14,  1.58s/it]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:09<00:12,  1.56s/it]
Loading safetensors checkpoint shards:  50% Completed | 7/14 [00:10<00:10,  1.55s/it]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:12<00:09,  1.53s/it]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [00:13<00:06,  1.40s/it]
Loading safetensors checkpoint shards:  71% Completed | 10/14 [00:14<00:05,  1.39s/it]
Loading safetensors checkpoint shards:  79% Completed | 11/14 [00:16<00:04,  1.36s/it]
Loading safetensors checkpoint shards:  86% Completed | 12/14 [00:17<00:02,  1.34s/it]
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:18<00:01,  1.25s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:18<00:00,  1.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:18<00:00,  1.34s/it]
(EngineCore_DP0 pid=5546) 
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:16 [default_loader.py:293] Loading weights took 18.83 seconds
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:19 [gpu_model_runner.py:4338] Model loading took 21.06 GiB memory and 25.522889 seconds
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:19 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:30 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/310f2de252/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:30 [backends.py:976] Dynamo bytecode transform time: 6.27 s
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:31 [backends.py:350] Cache the graph of compile range (1, 2096) for later use
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:33 [backends.py:366] Compiling a graph for compile range (1, 2096) takes 1.63 s
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:33 [monitor.py:35] torch.compile takes 9.16 s in total
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:33 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/3b6fbaf2a1f507f42c79b8c3c18421e58ef604c3df871b46d9d16e82616da06b/rank_0_0/model
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:34 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/3b6fbaf2a1f507f42c79b8c3c18421e58ef604c3df871b46d9d16e82616da06b/rank_0_0/model
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:35 [gpu_worker.py:424] Available KV cache memory: 5.16 GiB
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:35 [kv_cache_utils.py:1314] GPU KV cache size: 134,144 tokens
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:35 [kv_cache_utils.py:1319] Maximum concurrency for 8,192 tokens per request: 25.80x
(EngineCore_DP0 pid=5546) 2026-03-05 10:40:35,461 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=5546) 2026-03-05 10:40:35,504 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████| 51/51 [00:04<00:00, 11.11it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:04<00:00,  8.52it/s]
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:45 [gpu_model_runner.py:5360] Graph capturing finished in 10 secs, took 1.50 GiB
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:45 [core.py:282] init engine (profile, create kv cache, warmup model) took 25.64 seconds
(EngineCore_DP0 pid=5546) INFO 03-05 10:40:45 [vllm.py:747] Asynchronous scheduling is enabled.
(APIServer pid=5323) INFO 03-05 10:40:45 [api_server.py:495] Supported tasks: ['generate']
(APIServer pid=5323) WARNING 03-05 10:40:45 [model.py:1354] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=5323) INFO 03-05 10:40:46 [serving.py:185] Warming up chat template processing...
(APIServer pid=5323) INFO 03-05 10:40:49 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=5323) INFO 03-05 10:40:49 [serving.py:210] Chat template warmup completed in 3536.5ms
(APIServer pid=5323) INFO 03-05 10:40:49 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:38] Available routes are:
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=5323) INFO 03-05 10:40:49 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=5323) INFO:     Started server process [5323]
(APIServer pid=5323) INFO:     Waiting for application startup.
(APIServer pid=5323) INFO:     Application startup complete.
(APIServer pid=5323) INFO:     159.26.107.27:30627 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=5323) INFO:     159.26.107.27:29110 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=5546) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (11) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=5546)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=5546) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (11) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=5546)   return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.16.1rc1.dev206+g097eb544e) with config: model='Qwen/Qwen3.5-35B-A3B-GPTQ-Int4', speculative_config=None, tokenizer='Qwen/Qwen3.5-35B-A3B-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3.5-35B-A3B-GPTQ-Int4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '/root/.cache/vllm/torch_compile_cache/310f2de252', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': '/root/.cache/vllm/torch_compile_cache/310f2de252/rank_0_0/backbone', 'fast_moe_cold_start': True, 'static_all_moe_layers': ['language_model.model.layers.0.mlp.experts', 'language_model.model.layers.1.mlp.experts', 'language_model.model.layers.2.mlp.experts', 'language_model.model.layers.3.mlp.experts', 'language_model.model.layers.4.mlp.experts', 'language_model.model.layers.5.mlp.experts', 'language_model.model.layers.6.mlp.experts', 'language_model.model.layers.7.mlp.experts', 'language_model.model.layers.8.mlp.experts', 'language_model.model.layers.9.mlp.experts', 'language_model.model.layers.10.mlp.experts', 'language_model.model.layers.11.mlp.experts', 'language_model.model.layers.12.mlp.experts', 'language_model.model.layers.13.mlp.experts', 'language_model.model.layers.14.mlp.experts', 'language_model.model.layers.15.mlp.experts', 'language_model.model.layers.16.mlp.experts', 'language_model.model.layers.17.mlp.experts', 'language_model.model.layers.18.mlp.experts', 'language_model.model.layers.19.mlp.experts', 'language_model.model.layers.20.mlp.experts', 'language_model.model.layers.21.mlp.experts', 'language_model.model.layers.22.mlp.experts', 'language_model.model.layers.23.mlp.experts', 'language_model.model.layers.24.mlp.experts', 'language_model.model.layers.25.mlp.experts', 'language_model.model.layers.26.mlp.experts', 'language_model.model.layers.27.mlp.experts', 'language_model.model.layers.28.mlp.experts', 'language_model.model.layers.29.mlp.experts', 'language_model.model.layers.30.mlp.experts', 'language_model.model.layers.31.mlp.experts', 'language_model.model.layers.32.mlp.experts', 'language_model.model.layers.33.mlp.experts', 'language_model.model.layers.34.mlp.experts', 'language_model.model.layers.35.mlp.experts', 'language_model.model.layers.36.mlp.experts', 'language_model.model.layers.37.mlp.experts', 'language_model.model.layers.38.mlp.experts', 'language_model.model.layers.39.mlp.experts']}, 
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-95082e66b8f33fc1-82a0be60,prompt_token_ids_len=11,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=8181, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={chatcmpl-95082e66b8f33fc1-82a0be60: 11}, total_num_scheduled_tokens=11, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 1], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.015564202334630295, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=11, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102] Traceback (most recent call last):
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     self._process_engine_step()
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 501, in step_with_batch_queue
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     exec_model_fut.result()
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self.__get_result()
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     raise self._exception
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return func(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self.worker.execute_model(scheduler_output)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return func(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 720, in execute_model
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     output = self.model_runner.execute_model(
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return func(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3613, in execute_model
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     model_output = self._model_forward(
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]                    ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3126, in _model_forward
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self.model(
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 223, in __call__
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 738, in forward
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     hidden_states = self.language_model.model(
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 402, in __call__
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self.fn(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1132, in forward
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     def forward(
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 198, in __call__
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     raise e
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "<eval_with_key>.83", line 256, in forward
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     submod_1 = self.submod_1(getitem, s59, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     raise e
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "<eval_with_key>.85", line 5, in forward
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     gdn_attention_core = torch.ops.vllm.gdn_attention_core(mixed_qkv, b_1, a_1, core_attn_out, 'language_model.model.layers.0.linear_attn');  mixed_qkv = b_1 = a_1 = core_attn_out = gdn_attention_core = None
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in __call__
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1451, in gdn_attention_core
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     self._forward_core(
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 780, in _forward_core
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     ) = self.chunk_gated_delta_rule(
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py", line 129, in forward
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self._forward_method(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 200, in forward_native
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return fla_chunk_gated_delta_rule(
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return fn(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/chunk.py", line 207, in chunk_gated_delta_rule
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     o, final_state = ChunkGatedDeltaRuleFunction.apply(
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 583, in apply
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return super().apply(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py", line 113, in wrapper
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/torch/amp/autocast_mode.py", line 477, in decorate_fwd
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return fwd(*args, **kwargs)  # pyrefly: ignore [not-callable]
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/chunk.py", line 94, in forward
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     g, o, A, final_state, w, h, v_new = chunk_gated_delta_rule_fwd(
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/chunk.py", line 40, in chunk_gated_delta_rule_fwd
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     A = solve_tril(A=A, cu_seqlens=cu_seqlens, output_dtype=k.dtype)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py", line 113, in wrapper
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/solve_tril.py", line 545, in solve_tril
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     merge_fn[NT, B * H](
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 370, in <lambda>
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 459, in run
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self.fn.run(*args, **kwargs)
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 240, in run
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     benchmark()
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 229, in benchmark
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 164, in _bench
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/triton/testing.py", line 149, in do_bench
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     fn()
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 150, in kernel_call
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     self.fn.run(
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 744, in run
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/driver.py", line 713, in __call__
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102]     self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, self.launch_pdl,
(EngineCore_DP0 pid=5546) ERROR 03-05 10:41:09 [core.py:1102] RuntimeError: Triton Error [CUDA]: out of memory
(EngineCore_DP0 pid=5546) Process EngineCore_DP0:
(APIServer pid=5323) ERROR 03-05 10:41:09 [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=5323) ERROR 03-05 10:41:09 [async_llm.py:708] Traceback (most recent call last):
(APIServer pid=5323) ERROR 03-05 10:41:09 [async_llm.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler
(APIServer pid=5323) ERROR 03-05 10:41:09 [async_llm.py:708]     outputs = await engine_core.get_output_async()
(APIServer pid=5323) ERROR 03-05 10:41:09 [async_llm.py:708]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=5323) ERROR 03-05 10:41:09 [async_llm.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async
(APIServer pid=5323) ERROR 03-05 10:41:09 [async_llm.py:708]     raise self._format_exception(outputs) from None
(APIServer pid=5323) ERROR 03-05 10:41:09 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390] Error in chat completion stream generator.
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390] Traceback (most recent call last):
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 714, in chat_completion_stream_generator
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390]     async for res in result_generator:
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 583, in generate
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390]     out = q.get_nowait() or await q.get()
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390]                             ^^^^^^^^^^^^^
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390]     raise output
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390]     outputs = await engine_core.get_output_async()
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390]     raise self._format_exception(outputs) from None
(APIServer pid=5323) ERROR 03-05 10:41:09 [serving.py:1390] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore_DP0 pid=5546) Traceback (most recent call last):
(EngineCore_DP0 pid=5546)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=5546)     self.run()
(EngineCore_DP0 pid=5546)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=5546)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core
(EngineCore_DP0 pid=5546)     raise e
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
(EngineCore_DP0 pid=5546)     engine_core.run_busy_loop()
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
(EngineCore_DP0 pid=5546)     self._process_engine_step()
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
(EngineCore_DP0 pid=5546)     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=5546)                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 501, in step_with_batch_queue
(EngineCore_DP0 pid=5546)     exec_model_fut.result()
(EngineCore_DP0 pid=5546)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=5546)     return self.__get_result()
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=5546)     raise self._exception
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore_DP0 pid=5546)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=5546)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=5546)     return func(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model
(EngineCore_DP0 pid=5546)     return self.worker.execute_model(scheduler_output)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=5546)     return func(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 720, in execute_model
(EngineCore_DP0 pid=5546)     output = self.model_runner.execute_model(
(EngineCore_DP0 pid=5546)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=5546)     return func(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3613, in execute_model
(EngineCore_DP0 pid=5546)     model_output = self._model_forward(
(EngineCore_DP0 pid=5546)                    ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3126, in _model_forward
(EngineCore_DP0 pid=5546)     return self.model(
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 223, in __call__
(EngineCore_DP0 pid=5546)     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=5546)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=5546)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 738, in forward
(EngineCore_DP0 pid=5546)     hidden_states = self.language_model.model(
(EngineCore_DP0 pid=5546)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 402, in __call__
(EngineCore_DP0 pid=5546)     return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore_DP0 pid=5546)     return self.fn(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1132, in forward
(EngineCore_DP0 pid=5546)     def forward(
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 198, in __call__
(EngineCore_DP0 pid=5546)     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=5546)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=5546)     raise e
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=5546)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=5546)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=5546)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "<eval_with_key>.83", line 256, in forward
(EngineCore_DP0 pid=5546)     submod_1 = self.submod_1(getitem, s59, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore_DP0 pid=5546)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=5546)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=5546)     raise e
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=5546)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=5546)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=5546)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "<eval_with_key>.85", line 5, in forward
(EngineCore_DP0 pid=5546)     gdn_attention_core = torch.ops.vllm.gdn_attention_core(mixed_qkv, b_1, a_1, core_attn_out, 'language_model.model.layers.0.linear_attn');  mixed_qkv = b_1 = a_1 = core_attn_out = gdn_attention_core = None
(EngineCore_DP0 pid=5546)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in __call__
(EngineCore_DP0 pid=5546)     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1451, in gdn_attention_core
(EngineCore_DP0 pid=5546)     self._forward_core(
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 780, in _forward_core
(EngineCore_DP0 pid=5546)     ) = self.chunk_gated_delta_rule(
(EngineCore_DP0 pid=5546)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=5546)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=5546)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py", line 129, in forward
(EngineCore_DP0 pid=5546)     return self._forward_method(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 200, in forward_native
(EngineCore_DP0 pid=5546)     return fla_chunk_gated_delta_rule(
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(EngineCore_DP0 pid=5546)     return fn(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/chunk.py", line 207, in chunk_gated_delta_rule
(EngineCore_DP0 pid=5546)     o, final_state = ChunkGatedDeltaRuleFunction.apply(
(EngineCore_DP0 pid=5546)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 583, in apply
(EngineCore_DP0 pid=5546)     return super().apply(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py", line 113, in wrapper
(EngineCore_DP0 pid=5546)     return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/torch/amp/autocast_mode.py", line 477, in decorate_fwd
(EngineCore_DP0 pid=5546)     return fwd(*args, **kwargs)  # pyrefly: ignore [not-callable]
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/chunk.py", line 94, in forward
(EngineCore_DP0 pid=5546)     g, o, A, final_state, w, h, v_new = chunk_gated_delta_rule_fwd(
(EngineCore_DP0 pid=5546)                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/chunk.py", line 40, in chunk_gated_delta_rule_fwd
(EngineCore_DP0 pid=5546)     A = solve_tril(A=A, cu_seqlens=cu_seqlens, output_dtype=k.dtype)
(EngineCore_DP0 pid=5546)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py", line 113, in wrapper
(EngineCore_DP0 pid=5546)     return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/solve_tril.py", line 545, in solve_tril
(EngineCore_DP0 pid=5546)     merge_fn[NT, B * H](
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 370, in <lambda>
(EngineCore_DP0 pid=5546)     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_DP0 pid=5546)                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 459, in run
(EngineCore_DP0 pid=5546)     return self.fn.run(*args, **kwargs)
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 240, in run
(EngineCore_DP0 pid=5546)     benchmark()
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 229, in benchmark
(EngineCore_DP0 pid=5546)     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
(EngineCore_DP0 pid=5546)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 164, in _bench
(EngineCore_DP0 pid=5546)     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
(EngineCore_DP0 pid=5546)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/triton/testing.py", line 149, in do_bench
(EngineCore_DP0 pid=5546)     fn()
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 150, in kernel_call
(EngineCore_DP0 pid=5546)     self.fn.run(
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 744, in run
(EngineCore_DP0 pid=5546)     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
(EngineCore_DP0 pid=5546)   File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/driver.py", line 713, in __call__
(EngineCore_DP0 pid=5546)     self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, self.launch_pdl,
(EngineCore_DP0 pid=5546) RuntimeError: Triton Error [CUDA]: out of memory
[rank0]:[W305 10:41:09.102318031 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=5323) INFO:     159.26.107.27:53360 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=5323) INFO:     159.26.107.27:24749 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=5323) INFO:     159.26.107.27:45212 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=5323) INFO:     159.26.107.27:12680 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=5323) INFO:     Shutting down
(APIServer pid=5323) INFO:     Waiting for application shutdown.
(APIServer pid=5323) INFO:     Application shutdown complete.
(APIServer pid=5323) INFO:     Finished server process [5323]

3090 x2 here, reduce --gpu-memory-utilization to 0.8 ~0.85, OOM fixed

I really would not try to use an A3B model for clinical decision support, IN PRODUCTION, if I were you. These tiny distilled models seem to parrot their teacher models without much understanding, or rather with serious misunderstandings. That's a serious liability in clinical decision support. My understanding is that nuance comes from the model width and capability from the model depth, so it's not SUPER straightforward, but generally I'd say you need 100B+ for that, maybe 200B+, for any real comprehension of nuance. This might be fine for fast testing, though.

Sign up or log in to comment