Huihui-Qwen3.5-27B-W8A8-INT8
This is a W8A8 INT8 quantized version of huihui-ai/Huihui-Qwen3.5-27B-abliterated.
Quantization Details
- Quantization Method: W8A8 INT8 (compressed-tensors format)
- Base Model: Huihui-Qwen3.5-27B-abliterated (27B parameters)
- Model Size: ~28GB (down from ~55GB FP16)
Usage with vLLM
vllm serve /path/to/Huihui-Qwen3.5-27B-W8A8-INT8 \
--port 8000 \
--tensor-parallel-size 2 \
--max-model-len 8000 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--trust-remote-code \
--enforce-eager
Note: --enforce-eager is required for this quantized model due to CUDA graph compatibility issues with compressed-tensors format.
Performance
- Inference Speed: ~18-19 tokens/second (2x RTX 3090, TP=2)
- Memory Usage: ~22GB VRAM per GPU
Limitations
- Requires
--enforce-eagerflag in vLLM (no CUDA graphs) - Compatible with vLLM 0.16.1+
- Uses compressed-tensors quantization format
Original Model
This model is quantized from huihui-ai/Huihui-Qwen3.5-27B-abliterated.
Please refer to the original model card for more details about the base model's capabilities and license.
- Downloads last month
- 3,448
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for groxaxo/Huihui-Qwen3.5-27B-W8A8-INT8
Base model
Qwen/Qwen3.5-27B Finetuned
huihui-ai/Huihui-Qwen3.5-27B-abliterated