Serving via vLLM or SGLANG

by BW-Projects - opened Feb 25

Feb 25

Hey, thanks for the quant!

Did you get it working properly though?

Via VLLM and vllm/vllm-openai:qwen3_5 it only works with --enforce-eager
Otherwise it crashes and startup despite having enough VRAM. Sglang also just crashes at startup.

Other than that it seems to be very slow compared to models such as GLM-4.5-air and also a lot of endless repetition & thinking.

Thanks!

cpatonn

cyankiwi org Feb 25

Thank you for using the model! It has been tested in several environments before being published.

Could I see the error logs? As I hope I can possibly help you with errors in vllm and sglang. I'm not sure about vllm, but in sglang, it likely crashes due to different model initiation approaches, and therefore name/params mismatch.

In regards to slow speed, I would recommend using Multi-Token Prediction (MTP) to increase speed.

And yes, I agree that the model is kinda verbose. Which use cases/languages do you use for the model? If it is not English, perhaps I can increase calibration data in that language, to improve the model quality.

swedish

Feb 26

Thanks for the quant!

I'm also running into this issue with --enforce-eager although I wonder if the CUDA graphs are just much larger with this model. Maybe because of hybrid attention?

llmfan46

Feb 27

•

edited Feb 27

Via VLLM and vllm/vllm-openai:qwen3_5 it only works with --enforce-eager
Otherwise it crashes and startup despite having enough VRAM. Sglang also just crashes at startup.

Could the issues be that vLLM expects AWQ and this is compressed-tensors instead?

"config_groups": {
"group_0": {
"format": "pack-quantized",
"input_activations": null,
"output_activations": null,
"targets": [
"Linear"
],
"weights": {
"actorder": null,
"block_structure": null,
"dynamic": false,
"group_size": 32,
"num_bits": 4,
"observer": "mse",
"observer_kwargs": {},
"scale_dtype": null,
"strategy": "group",
"symmetric": true,
"type": "int",
"zp_dtype": null
"kv_cache_scheme": null,
"quant_method": "compressed-tensors",
"quantization_status": "compressed",
"sparsity_config": {},
"transform_config": {},
"version": "0.13.1.a20260223"

Is it possible that when it loads it gets unpacked and uncompressed to the full BF16?

xxhf

Feb 28

I found an issue: when the output contains both Chinese and English, the model adds a space between them, even when instructed not to. It outputs extra spaces and believes it hasn't added any.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment