Huihui-Qwen3.5-27B-abliterated — W4A16 (compressed-tensors)

4-bit weight quantization of huihui-ai/Huihui-Qwen3.5-27B-abliterated using llm-compressor W4A16 scheme.

Key specs

Property Value
Base model Qwen3.5-27B (abliterated)
Quantization W4A16 (4-bit weights, 16-bit activations)
Format compressed-tensors (vLLM native)
Size on disk 17.6 GB
GPU VRAM ~16 GB (fits RTX 5090 32GB with MTP + KV cache)
Calibration 128 samples from Pile validation set

Usage with vLLM

python -m vllm.entrypoints.openai.api_server \
    --model j-a-a-a-y/Huihui-Qwen3.5-27B-abliterated-W4A16-compressed-tensors \
    --served-model-name qwen3.5-27b \
    --dtype float16 \
    --max-model-len 4096 \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 5}' \
    --performance-mode interactivity

Benchmarks (RTX 5090, 32GB)

Benchmarked with MTP=5 speculative decoding + CUDA graphs + interactivity mode:

Metric Value
Single request (256 tok) ~149 tok/s
Single request (512 tok) ~131 tok/s
Batch=4 aggregate ~410 tok/s
MTP acceptance rate 50%

Quantization details

Quantized with vLLM's llm-compressor using the W4A16 scheme:

  • Per-group symmetric quantization (group_size=128)
  • Activation-aware calibration (128 samples, max_length=512)
  • lm_head kept at full precision

Compared to GPTQ W4A16: ~2 GB smaller on disk, ~1.5 GB less VRAM, same inference speed (both use Marlin kernel).

Downloads last month
68
Safetensors
Model size
27B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for j-a-a-a-y/Huihui-Qwen3.5-27B-abliterated-W4A16-compressed-tensors

Base model

Qwen/Qwen3.5-27B
Quantized
(17)
this model