Huihui-Qwen3.5-27B-abliterated — W4A16 (compressed-tensors)
4-bit weight quantization of huihui-ai/Huihui-Qwen3.5-27B-abliterated using llm-compressor W4A16 scheme.
Key specs
| Property | Value |
|---|---|
| Base model | Qwen3.5-27B (abliterated) |
| Quantization | W4A16 (4-bit weights, 16-bit activations) |
| Format | compressed-tensors (vLLM native) |
| Size on disk | 17.6 GB |
| GPU VRAM | ~16 GB (fits RTX 5090 32GB with MTP + KV cache) |
| Calibration | 128 samples from Pile validation set |
Usage with vLLM
python -m vllm.entrypoints.openai.api_server \
--model j-a-a-a-y/Huihui-Qwen3.5-27B-abliterated-W4A16-compressed-tensors \
--served-model-name qwen3.5-27b \
--dtype float16 \
--max-model-len 4096 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 5}' \
--performance-mode interactivity
Benchmarks (RTX 5090, 32GB)
Benchmarked with MTP=5 speculative decoding + CUDA graphs + interactivity mode:
| Metric | Value |
|---|---|
| Single request (256 tok) | ~149 tok/s |
| Single request (512 tok) | ~131 tok/s |
| Batch=4 aggregate | ~410 tok/s |
| MTP acceptance rate | 50% |
Quantization details
Quantized with vLLM's llm-compressor using the W4A16 scheme:
- Per-group symmetric quantization (group_size=128)
- Activation-aware calibration (128 samples, max_length=512)
- lm_head kept at full precision
Compared to GPTQ W4A16: ~2 GB smaller on disk, ~1.5 GB less VRAM, same inference speed (both use Marlin kernel).
- Downloads last month
- 68
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for j-a-a-a-y/Huihui-Qwen3.5-27B-abliterated-W4A16-compressed-tensors
Base model
Qwen/Qwen3.5-27B Finetuned
huihui-ai/Huihui-Qwen3.5-27B-abliterated