These are NOT actual AWQ-quantized models.
Heads up! Despite the "AWQ" tag in the title, the config.json reveals these models are using standard compressed-tensors (W4A16) rather than the AWQ (Activation-aware Weight Quantization) method. Real AWQ requires an activation calibration process and specific scaling factors, which are missing here. This is misleading for users looking for actual AWQ kernels.
AWQ is the algorithm used to optimize this model, whereas compressed-tensors is the format i.e., weight_packed, weight_scale, weight_zero_point, weight_shape that the model is saved after quantization.
In regards to kernels used for inference, vllm uses the same Marlin kernel for compressed-tensors and AutoAWQ format, but via different routes.
Heads up! Despite the "AWQ" tag in the title, the config.json reveals these models are using standard compressed-tensors (W4A16) rather than the AWQ (Activation-aware Weight Quantization) method. Real AWQ requires an activation calibration process and specific scaling factors, which are missing here. This is misleading for users looking for actual AWQ kernels.
https://github.com/vllm-project/llm-compressor/blob/main/examples/awq/README.md