lukey03/Qwen3.5-9B-abliterated GPTQ-Pro 4-bit g128

This is a GPTQ-Pro 4-bit / group-size-128 export of lukey03/Qwen3.5-9B-abliterated, produced with GPTQModel.

Why this checkpoint

I compared the local BF16 base model, this GPTQ-Pro export, symmetric AWQ GEMM, and asymmetric AWQ GEMM on 1x RTX 3090 with warmed generation runs and WikiText-2 perplexity:

Variant Perplexity Warm speed
BF16 base 10.2572 31.36 tok/s
GPTQ-Pro local runtime (BACKEND.GPTQ_PRO) 10.6271 4.95 tok/s
GPTQ-Pro deployed with Marlin quality preserved 31.18 tok/s
AWQ GEMM (sym=True) 37581.1150 14.35 tok/s
AWQ GEMM (sym=False) 75868.4507 37.14 tok/s

Result: GPTQ-Pro preserved quality, while AWQ was not quality-safe on this Qwen3.5 family.

Quantization settings

  • bits=4
  • group_size=128
  • sym=True
  • desc_act=False
  • format=gptq
  • quant_method=gptq
  • pack_dtype=int32

For Qwen3.5 text checkpoints, quantization should use batch_size=1.

Recommended runtime

For correctness or direct local validation, BACKEND.GPTQ_PRO works.

For actual deployment, the recommended path is:

  1. quantize with GPTQ-Pro
  2. serve with Marlin
  3. for highest throughput on this model family, use the patched Qwen3.5 vLLM wrapper from GPTQ-Pro

The repo-validated fast path is vLLM + gptq_marlin through:

  • scripts/serve_vllm_qwen35.py

That wrapper now auto-detects qwen3_5_text from either a local folder or a Hub repo ID.

Quickstart: GPTQModel + Marlin

from transformers import AutoTokenizer
from gptqmodel import GPTQModel, BACKEND

MODEL_ID = "groxaxo/lukey03-Qwen3.5-9B-abliterated-gptq-pro-w4g128"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = GPTQModel.load(
    MODEL_ID,
    backend=BACKEND.MARLIN,
    device="cuda:0",
    trust_remote_code=True,
)

prompt = "State one short fact about model quantization."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quickstart: patched vLLM serve path

Clone the matching repo branch first so you get the Qwen3.5 wrapper and patches:

git clone https://github.com/groxaxo/GPTQ-Pro.git
cd GPTQ-Pro
git checkout gptq-pro-cuda-kernel

Then launch the model:

CUDA_VISIBLE_DEVICES=0 \
python scripts/serve_vllm_qwen35.py \
  --model groxaxo/lukey03-Qwen3.5-9B-abliterated-gptq-pro-w4g128 \
  --served-model-name qwen35-9b-gptq-pro \
  --host 0.0.0.0 \
  --port 8011 \
  --tensor-parallel-size 1

Two-GPU shared-host launch:

CUDA_VISIBLE_DEVICES=0,1 \
python scripts/serve_vllm_qwen35.py \
  --model groxaxo/lukey03-Qwen3.5-9B-abliterated-gptq-pro-w4g128 \
  --served-model-name qwen35-9b-gptq-pro \
  --host 0.0.0.0 \
  --port 8012 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.4

When the fast path is active, the logs should contain:

Using MarlinLinearKernel for GPTQMarlinLinearMethod

Speed demon settings for future runs

If you want the same tradeoff in future Qwen3.5 quantization + deployment runs:

  • quantize with GPTQ-Pro, bits=4, group_size=128, sym=True, desc_act=False
  • keep batch_size=1 during Qwen3.5 quantization
  • deploy with BACKEND.MARLIN or the patched vLLM wrapper
  • prefer tensor_parallel_size=1 first, then scale to 2 GPUs if needed
  • on shared hosts, start with --gpu-memory-utilization 0.4
  • confirm the logs show MarlinLinearKernel

Notes

  • BACKEND.GPTQ_PRO is a functional runtime path, but it is not the throughput-optimized serving path.
  • I did not publish the AWQ exports for this model family as the recommended release because both symmetric and asymmetric AWQ failed the quality check on this benchmark setup.
Downloads last month
4,006
Safetensors
Model size
9B params
Tensor type
BF16
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for groxaxo/lukey03-Qwen3.5-9B-abliterated-gptq-pro-w4g128

Finetuned
Qwen/Qwen3.5-9B
Quantized
(4)
this model