These are NOT actual AWQ-quantized models.

by cai-cai - opened 5 days ago

Heads up! Despite the "AWQ" tag in the title, the config.json reveals these models are using standard compressed-tensors (W4A16) rather than the AWQ (Activation-aware Weight Quantization) method. Real AWQ requires an activation calibration process and specific scaling factors, which are missing here. This is misleading for users looking for actual AWQ kernels.

cpatonn

cyankiwi org 4 days ago

AWQ is the algorithm used to optimize this model, whereas compressed-tensors is the format i.e., weight_packed, weight_scale, weight_zero_point, weight_shape that the model is saved after quantization.

In regards to kernels used for inference, vllm uses the same Marlin kernel for compressed-tensors and AutoAWQ format, but via different routes.

CHNtentes

4 days ago

Heads up! Despite the "AWQ" tag in the title, the config.json reveals these models are using standard compressed-tensors (W4A16) rather than the AWQ (Activation-aware Weight Quantization) method. Real AWQ requires an activation calibration process and specific scaling factors, which are missing here. This is misleading for users looking for actual AWQ kernels.

https://github.com/vllm-project/llm-compressor/blob/main/examples/awq/README.md

zhuyuzhe1987

3 days ago

这个模型可以使用Lvllm进行混合推理 https://github.com/guqiong96/Lvllm/blob/main/README.md

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment