Serving via vLLM or SGLANG

#1
by BW-Projects - opened

Hey, thanks for the quant!

Did you get it working properly though?

Via VLLM and vllm/vllm-openai:qwen3_5 it only works with --enforce-eager
Otherwise it crashes and startup despite having enough VRAM. Sglang also just crashes at startup.

Other than that it seems to be very slow compared to models such as GLM-4.5-air and also a lot of endless repetition & thinking.

Thanks!

cyankiwi org

Thank you for using the model! It has been tested in several environments before being published.

Could I see the error logs? As I hope I can possibly help you with errors in vllm and sglang. I'm not sure about vllm, but in sglang, it likely crashes due to different model initiation approaches, and therefore name/params mismatch.

In regards to slow speed, I would recommend using Multi-Token Prediction (MTP) to increase speed.

And yes, I agree that the model is kinda verbose. Which use cases/languages do you use for the model? If it is not English, perhaps I can increase calibration data in that language, to improve the model quality.

Thanks for the quant!

I'm also running into this issue with --enforce-eager although I wonder if the CUDA graphs are just much larger with this model. Maybe because of hybrid attention?

Via VLLM and vllm/vllm-openai:qwen3_5 it only works with --enforce-eager
Otherwise it crashes and startup despite having enough VRAM. Sglang also just crashes at startup.

Could the issues be that vLLM expects AWQ and this is compressed-tensors instead?

"config_groups": {
"group_0": {
"format": "pack-quantized",
"input_activations": null,
"output_activations": null,
"targets": [
"Linear"
],
"weights": {
"actorder": null,
"block_structure": null,
"dynamic": false,
"group_size": 32,
"num_bits": 4,
"observer": "mse",
"observer_kwargs": {},
"scale_dtype": null,
"strategy": "group",
"symmetric": true,
"type": "int",
"zp_dtype": null
"kv_cache_scheme": null,
"quant_method": "compressed-tensors",
"quantization_status": "compressed",
"sparsity_config": {},
"transform_config": {},
"version": "0.13.1.a20260223"

Is it possible that when it loads it gets unpacked and uncompressed to the full BF16?

I found an issue: when the output contains both Chinese and English, the model adds a space between them, even when instructed not to. It outputs extra spaces and believes it hasn't added any.

Sign up or log in to comment