Huihui-Qwen3.5-27B-W8A8-INT8

This is a W8A8 INT8 quantized version of huihui-ai/Huihui-Qwen3.5-27B-abliterated.

Quantization Details

  • Quantization Method: W8A8 INT8 (compressed-tensors format)
  • Base Model: Huihui-Qwen3.5-27B-abliterated (27B parameters)
  • Model Size: ~28GB (down from ~55GB FP16)

Usage with vLLM

vllm serve /path/to/Huihui-Qwen3.5-27B-W8A8-INT8 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 8000 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --trust-remote-code \
  --enforce-eager

Note: --enforce-eager is required for this quantized model due to CUDA graph compatibility issues with compressed-tensors format.

Performance

  • Inference Speed: ~18-19 tokens/second (2x RTX 3090, TP=2)
  • Memory Usage: ~22GB VRAM per GPU

Limitations

  • Requires --enforce-eager flag in vLLM (no CUDA graphs)
  • Compatible with vLLM 0.16.1+
  • Uses compressed-tensors quantization format

Original Model

This model is quantized from huihui-ai/Huihui-Qwen3.5-27B-abliterated.

Please refer to the original model card for more details about the base model's capabilities and license.

Downloads last month
3,448
Safetensors
Model size
27B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for groxaxo/Huihui-Qwen3.5-27B-W8A8-INT8

Base model

Qwen/Qwen3.5-27B
Quantized
(17)
this model