Instructions to use z-lab/Qwen3.6-27B-DFlash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use z-lab/Qwen3.6-27B-DFlash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="z-lab/Qwen3.6-27B-DFlash", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("z-lab/Qwen3.6-27B-DFlash", trust_remote_code=True)
model = AutoModel.from_pretrained("z-lab/Qwen3.6-27B-DFlash", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use z-lab/Qwen3.6-27B-DFlash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "z-lab/Qwen3.6-27B-DFlash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/Qwen3.6-27B-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/z-lab/Qwen3.6-27B-DFlash

SGLang

How to use z-lab/Qwen3.6-27B-DFlash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "z-lab/Qwen3.6-27B-DFlash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/Qwen3.6-27B-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "z-lab/Qwen3.6-27B-DFlash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/Qwen3.6-27B-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use z-lab/Qwen3.6-27B-DFlash with Docker Model Runner:
```
docker model run hf.co/z-lab/Qwen3.6-27B-DFlash
```

It is fast but run for a while easily get error and cause vllm stop.

by james0010 - opened 1 day ago

Discussion

james0010

1 day ago

any idea about the following issue?

/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [12,1,0], thread: [185,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [12,1,0], thread: [186,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [12,1,0], thread: [187,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [12,1,0], thread: [188,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [12,1,0], thread: [189,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [12,1,0], thread: [190,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [12,1,0], thread: [191,0,0] Assertion ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds" failed.
(EngineCore pid=152) ERROR 05-07 04:51:50 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.20.2rc1.dev93+g26c6e0ace.d20260507) with config: model='Qwen/Qwen3.6-27B-FP8', speculative_config=SpeculativeConfig(method='dflash', model='z-lab/Qwen3.6-27B-DFlash', num_spec_tokens=6), tokenizer='Qwen/Qwen3.6-27B-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=fp8, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3.6-27B-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '/root/.cache/vllm/torch_compile_cache/23d22236d4', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8', '+quant_fp8', '+quant_fp8', '+quant_fp8', '+quant_fp8', '+quant_fp8', '+quant_fp8'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [7, 14, 21, 28, 35, 42, 49, 56, 70, 77, 84, 91, 98, 105, 112, 126, 133, 140, 147, 154, 161, 168, 182, 189, 196, 203, 210, 217, 224, 238, 245, 252, 259, 273, 294, 308, 322, 336, 357, 371, 385, 406, 420, 434, 448, 469, 483, 497], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 497, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': '/root/.cache/vllm/torch_compile_cache/23d22236d4/rank_0_0/eagle_head', 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=False, moe_backend='auto'),
(EngineCore pid=152) ERROR 05-07 04:51:50 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-5e392e53-1efe-487d-bafb-b718601e27f4-9b8d30bc', 'chatcmpl-6d535732-1ea4-48a6-a0b7-b572119c6bef-96cf96bc', 'chatcmpl-802e1226-3c08-4f95-b960-fd9dc9dc5055-831efb8d'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={'chatcmpl-802e1226-3c08-4f95-b960-fd9dc9dc5055-831efb8d': 44178},new_block_ids=[None, None, None],num_computed_tokens=[79460, 76601, 44177],num_output_tokens=[5461, 0, 117]), num_scheduled_tokens={chatcmpl-802e1226-3c08-4f95-b960-fd9dc9dc5055-831efb8d: 7, chatcmpl-5e392e53-1efe-487d-bafb-b718601e27f4-9b8d30bc: 7, chatcmpl-6d535732-1ea4-48a6-a0b7-b572119c6bef-96cf96bc: 17}, total_num_scheduled_tokens=31, scheduled_spec_decode_tokens={chatcmpl-802e1226-3c08-4f95-b960-fd9dc9dc5055-831efb8d: [-1, -1, -1, -1, -1, -1], chatcmpl-5e392e53-1efe-487d-bafb-b718601e27f4-9b8d30bc: [-1, -1, -1, -1, -1, -1]}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=null)
(EngineCore pid=152) ERROR 05-07 04:51:50 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=3, num_waiting_reqs=0, num_skipped_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.34732824427480913, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] EngineCore encountered a fatal error.
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] Traceback (most recent call last):
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1129, in run_engine_core
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] engine_core.run_busy_loop()
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1170, in run_busy_loop
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] self._process_engine_step()
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1209, in _process_engine_step
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] outputs, model_executed = self.step_fn()
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 473, in step_with_batch_queue
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] exec_future = self.model_executor.execute_model(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 114, in execute_model
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] output.result()
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self.__get_result()
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] raise self._exception
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 84, in collective_rpc
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return func(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 337, in execute_model
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self.worker.execute_model(scheduler_output)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return func(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 841, in execute_model
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] output = self.model_runner.execute_model(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return func(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4114, in execute_model
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] model_output = self._model_forward(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3587, in _model_forward
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self.model(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in call
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self.runnable(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self._call_impl(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return forward_call(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 695, in forward
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] hidden_states = self.language_model.model(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 520, in call
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 224, in call
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self.fn(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 495, in forward
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] def forward(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 217, in call
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self.optimized_call(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "", line 838, in execution_fn
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "", line 188, in __vllm_inlined_submods__87
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in call
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self._op(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 40, in wrapper
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return func(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/attention.py", line 723, in unified_attention_with_output
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] self.impl.forward(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 809, in forward
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] flash_attn_varlen_func(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 300, in flash_attn_varlen_func
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] out, softmax_lse = torch.ops._vllm_fa2_C.varlen_fwd(
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in call
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] return self._op(*args, **kwargs)
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] torch.AcceleratorError: CUDA error: device-side assert triggered
(EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] Search for cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] Compile with TORCH_USE_CUDA_DSAto enable device-side assertions. (EngineCore pid=152) ERROR 05-07 04:51:50 [core.py:1138] (EngineCore pid=152) Process EngineCore: (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] AsyncLLM output_handler failed. (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] Traceback (most recent call last): (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 660, in output_handler (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] outputs = await engine_core.get_output_async() (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] raise self._format_exception(outputs) from None (APIServer pid=77) ERROR 05-07 04:51:50 [async_llm.py:704] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] Error in chat completion stream generator. (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] Traceback (most recent call last): (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 519, in chat_completion_stream_generator (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] async for res in result_generator: (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 579, in generate (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] out = q.get_nowait() or await q.get() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise output (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 660, in output_handler (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] outputs = await engine_core.get_output_async() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise self._format_exception(outputs) from None (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] Error in chat completion stream generator. (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] Traceback (most recent call last): (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 519, in chat_completion_stream_generator (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] async for res in result_generator: (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 579, in generate (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] out = q.get_nowait() or await q.get() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise output (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 519, in chat_completion_stream_generator (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] async for res in result_generator: (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 579, in generate (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] out = q.get_nowait() or await q.get() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise output (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 660, in output_handler (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] outputs = await engine_core.get_output_async() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise self._format_exception(outputs) from None (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] Error in chat completion stream generator. (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] Traceback (most recent call last): (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 519, in chat_completion_stream_generator (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] async for res in result_generator: (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 579, in generate (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] out = q.get_nowait() or await q.get() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise output (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 519, in chat_completion_stream_generator (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] async for res in result_generator: (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 579, in generate (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] out = q.get_nowait() or await q.get() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise output (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 519, in chat_completion_stream_generator (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] async for res in result_generator: (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 579, in generate (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] out = q.get_nowait() or await q.get() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise output (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 660, in output_handler (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] outputs = await engine_core.get_output_async() (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] raise self._format_exception(outputs) from None (APIServer pid=77) ERROR 05-07 04:51:50 [serving.py:1143] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (EngineCore pid=152) Traceback (most recent call last): (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1140, in run_engine_core (EngineCore pid=152) raise e (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1129, in run_engine_core (EngineCore pid=152) engine_core.run_busy_loop() (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1170, in run_busy_loop (EngineCore pid=152) self._process_engine_step() (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1209, in _process_engine_step (EngineCore pid=152) outputs, model_executed = self.step_fn() (EngineCore pid=152) ^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 473, in step_with_batch_queue (EngineCore pid=152) exec_future = self.model_executor.execute_model( (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 114, in execute_model (EngineCore pid=152) output.result() (EngineCore pid=152) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore pid=152) return self.__get_result() (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore pid=152) raise self._exception (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 84, in collective_rpc (EngineCore pid=152) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 510, in run_method (EngineCore pid=152) return func(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 337, in execute_model (EngineCore pid=152) return self.worker.execute_model(scheduler_output) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=152) return func(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 841, in execute_model (EngineCore pid=152) output = self.model_runner.execute_model( (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=152) return func(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4114, in execute_model (EngineCore pid=152) model_output = self._model_forward( (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3587, in _model_forward (EngineCore pid=152) return self.model( (EngineCore pid=152) ^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__ (EngineCore pid=152) return self.runnable(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (EngineCore pid=152) return self._call_impl(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl (EngineCore pid=152) return forward_call(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 695, in forward (EngineCore pid=152) hidden_states = self.language_model.model( (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 520, in __call__ (EngineCore pid=152) return self.aot_compiled_fn(self, *args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 224, in __call__ (EngineCore pid=152) return self.fn(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 495, in forward (EngineCore pid=152) def forward( (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 217, in __call__ (EngineCore pid=152) return self.optimized_call(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "<string>", line 838, in execution_fn (EngineCore pid=152) File "<string>", line 188, in __vllm_inlined_submods__87 (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in __call__ (EngineCore pid=152) return self._op(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 40, in wrapper (EngineCore pid=152) return func(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/attention.py", line 723, in unified_attention_with_output (EngineCore pid=152) self.impl.forward( (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 809, in forward (EngineCore pid=152) flash_attn_varlen_func( (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 300, in flash_attn_varlen_func (EngineCore pid=152) out, softmax_lse = torch.ops._vllm_fa2_C.varlen_fwd( (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in __call__ (EngineCore pid=152) return self._op(*args, **kwargs) (EngineCore pid=152) ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=152) torch.AcceleratorError: CUDA error: device-side assert triggered (EngineCore pid=152) Search forcudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore pid=152) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore pid=152) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore pid=152) Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(EngineCore pid=152)
(EngineCore pid=152)
(EngineCore pid=152) During handling of the above exception, another exception occurred:
(EngineCore pid=152)
(EngineCore pid=152) Traceback (most recent call last):
(EngineCore pid=152) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=152) self.run()
(EngineCore pid=152) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=152) self._target(*self._args, **self._kwargs)
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1147, in run_engine_core
(EngineCore pid=152) engine_core.shutdown()
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 574, in shutdown
(EngineCore pid=152) self.model_executor.shutdown()
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 137, in shutdown
(EngineCore pid=152) worker.shutdown()
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 212, in shutdown
(EngineCore pid=152) self.worker.shutdown()
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 1049, in shutdown
(EngineCore pid=152) model_runner.shutdown()
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5992, in shutdown
(EngineCore pid=152) self._cleanup_profiling_kv_cache()
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6000, in _cleanup_profiling_kv_cache
(EngineCore pid=152) torch.accelerator.synchronize()
(EngineCore pid=152) File "/usr/local/lib/python3.12/dist-packages/torch/accelerator/init.py", line 263, in synchronize
(EngineCore pid=152) torch._C._accelerator_synchronizeDevice(device_index)
(EngineCore pid=152) torch.AcceleratorError: CUDA error: device-side assert triggered
(EngineCore pid=152) Search for cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore pid=152) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=152) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=152) Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore pid=152)
(APIServer pid=77) INFO: 172.20.0.3:51526 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=77) INFO: 172.20.0.3:51540 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=77) INFO: 172.20.0.3:51544 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=77) INFO: Shutting down
(APIServer pid=77) INFO: Waiting for application shutdown.
(APIServer pid=77) INFO: Application shutdown complete.
(APIServer pid=77) INFO: Finished server process [77]

james0010

1 day ago

I doubt that multiple query simultaneously easily cause the error

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment