lukey03/Qwen3.5-9B-abliterated GPTQ-Pro 4-bit g128
This is a GPTQ-Pro 4-bit / group-size-128 export of lukey03/Qwen3.5-9B-abliterated, produced with GPTQModel.
Why this checkpoint
I compared the local BF16 base model, this GPTQ-Pro export, symmetric AWQ GEMM, and asymmetric AWQ GEMM on 1x RTX 3090 with warmed generation runs and WikiText-2 perplexity:
| Variant | Perplexity | Warm speed |
|---|---|---|
| BF16 base | 10.2572 |
31.36 tok/s |
GPTQ-Pro local runtime (BACKEND.GPTQ_PRO) |
10.6271 |
4.95 tok/s |
| GPTQ-Pro deployed with Marlin | quality preserved | 31.18 tok/s |
AWQ GEMM (sym=True) |
37581.1150 |
14.35 tok/s |
AWQ GEMM (sym=False) |
75868.4507 |
37.14 tok/s |
Result: GPTQ-Pro preserved quality, while AWQ was not quality-safe on this Qwen3.5 family.
Quantization settings
bits=4group_size=128sym=Truedesc_act=Falseformat=gptqquant_method=gptqpack_dtype=int32
For Qwen3.5 text checkpoints, quantization should use batch_size=1.
Recommended runtime
For correctness or direct local validation, BACKEND.GPTQ_PRO works.
For actual deployment, the recommended path is:
- quantize with GPTQ-Pro
- serve with Marlin
- for highest throughput on this model family, use the patched Qwen3.5 vLLM wrapper from
GPTQ-Pro
The repo-validated fast path is vLLM + gptq_marlin through:
scripts/serve_vllm_qwen35.py
That wrapper now auto-detects qwen3_5_text from either a local folder or a Hub repo ID.
Quickstart: GPTQModel + Marlin
from transformers import AutoTokenizer
from gptqmodel import GPTQModel, BACKEND
MODEL_ID = "groxaxo/lukey03-Qwen3.5-9B-abliterated-gptq-pro-w4g128"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = GPTQModel.load(
MODEL_ID,
backend=BACKEND.MARLIN,
device="cuda:0",
trust_remote_code=True,
)
prompt = "State one short fact about model quantization."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quickstart: patched vLLM serve path
Clone the matching repo branch first so you get the Qwen3.5 wrapper and patches:
git clone https://github.com/groxaxo/GPTQ-Pro.git
cd GPTQ-Pro
git checkout gptq-pro-cuda-kernel
Then launch the model:
CUDA_VISIBLE_DEVICES=0 \
python scripts/serve_vllm_qwen35.py \
--model groxaxo/lukey03-Qwen3.5-9B-abliterated-gptq-pro-w4g128 \
--served-model-name qwen35-9b-gptq-pro \
--host 0.0.0.0 \
--port 8011 \
--tensor-parallel-size 1
Two-GPU shared-host launch:
CUDA_VISIBLE_DEVICES=0,1 \
python scripts/serve_vllm_qwen35.py \
--model groxaxo/lukey03-Qwen3.5-9B-abliterated-gptq-pro-w4g128 \
--served-model-name qwen35-9b-gptq-pro \
--host 0.0.0.0 \
--port 8012 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.4
When the fast path is active, the logs should contain:
Using MarlinLinearKernel for GPTQMarlinLinearMethod
Speed demon settings for future runs
If you want the same tradeoff in future Qwen3.5 quantization + deployment runs:
- quantize with
GPTQ-Pro,bits=4,group_size=128,sym=True,desc_act=False - keep
batch_size=1during Qwen3.5 quantization - deploy with
BACKEND.MARLINor the patchedvLLMwrapper - prefer
tensor_parallel_size=1first, then scale to2GPUs if needed - on shared hosts, start with
--gpu-memory-utilization 0.4 - confirm the logs show
MarlinLinearKernel
Notes
BACKEND.GPTQ_PROis a functional runtime path, but it is not the throughput-optimized serving path.- I did not publish the AWQ exports for this model family as the recommended release because both symmetric and asymmetric AWQ failed the quality check on this benchmark setup.
- Downloads last month
- 4,006