Huihui-Qwen3.5-27B-abliterated โ€” CT-W4A16 (text-only + MTP)

4-bit weight-only quantization of huihui-ai/Huihui-Qwen3.5-27B-abliterated.

Text-only: vision encoder removed to save 0.9 GB VRAM.
MTP preserved: speculative decoding works out of the box.

Specs

Property Value
Base model Qwen3.5-27B (abliterated)
Quantization W4A16 (compressed-tensors, group_size=128)
Size on disk 18.5 GB
vs GPTQ W4A16 -1.0 GB (no vision encoder)
MTP head Included (BF16, 0.85 GB)
Vision encoder Removed
Calibration 128 samples, Pile validation

Usage with vLLM (RTX 5090 / 32GB)

python -m vllm.entrypoints.openai.api_server \
    --model j-a-a-a-y/Huihui-Qwen3.5-27B-abliterated-CT-W4A16 \
    --served-model-name qwen3.5-27b \
    --dtype float16 \
    --max-model-len 1024 \
    --max-num-seqs 4 \
    --gpu-memory-utilization 0.97 \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 5}' \
    --performance-mode interactivity

Benchmarks (RTX 5090, MTP=5 + interactivity)

Metric GPTQ W4A16 This model
Single 256 tok 149 tok/s TBD
Single 512 tok 131 tok/s TBD
Batch=4 agg 410 tok/s TBD
MTP acceptance 50% TBD
Available KV cache 6.35 GiB ~7.3 GiB (est.)
Downloads last month
52
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for j-a-a-a-y/Huihui-Qwen3.5-27B-abliterated-CT-W4A16

Base model

Qwen/Qwen3.5-27B
Quantized
(17)
this model