lukey03/Qwen3.5-9B-abliterated GPTQ-Pro 4-bit g128

This is a GPTQ-Pro 4-bit / group-size-128 export of lukey03/Qwen3.5-9B-abliterated, produced with GPTQModel.

Why this checkpoint

I compared the local BF16 base model, this GPTQ-Pro export, symmetric AWQ GEMM, and asymmetric AWQ GEMM on 1x RTX 3090 with warmed generation runs and WikiText-2 perplexity:

Variant	Perplexity	Warm speed
BF16 base	`10.2572`	`31.36 tok/s`
GPTQ-Pro local runtime (`BACKEND.GPTQ_PRO`)	`10.6271`	`4.95 tok/s`
GPTQ-Pro deployed with Marlin	quality preserved	`31.18 tok/s`
AWQ GEMM (`sym=True`)	`37581.1150`	`14.35 tok/s`
AWQ GEMM (`sym=False`)	`75868.4507`	`37.14 tok/s`

Result: GPTQ-Pro preserved quality, while AWQ was not quality-safe on this Qwen3.5 family.

Quantization settings

bits=4
group_size=128
sym=True
desc_act=False
format=gptq
quant_method=gptq
pack_dtype=int32

For Qwen3.5 text checkpoints, quantization should use batch_size=1.

Recommended runtime

For correctness or direct local validation, BACKEND.GPTQ_PRO works.

For actual deployment, the recommended path is:

quantize with GPTQ-Pro
serve with Marlin
for highest throughput on this model family, use the patched Qwen3.5 vLLM wrapper from GPTQ-Pro

The repo-validated fast path is vLLM + gptq_marlin through:

scripts/serve_vllm_qwen35.py

That wrapper now auto-detects qwen3_5_text from either a local folder or a Hub repo ID.

Quickstart: GPTQModel + Marlin

from transformers import AutoTokenizer
from gptqmodel import GPTQModel, BACKEND

MODEL_ID = "groxaxo/lukey03-Qwen3.5-9B-abliterated-gptq-pro-w4g128"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = GPTQModel.load(
    MODEL_ID,
    backend=BACKEND.MARLIN,
    device="cuda:0",
    trust_remote_code=True,
)

prompt = "State one short fact about model quantization."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quickstart: patched vLLM serve path

Clone the matching repo branch first so you get the Qwen3.5 wrapper and patches:

git clone https://github.com/groxaxo/GPTQ-Pro.git
cd GPTQ-Pro
git checkout gptq-pro-cuda-kernel

Then launch the model:

CUDA_VISIBLE_DEVICES=0 \
python scripts/serve_vllm_qwen35.py \
  --model groxaxo/lukey03-Qwen3.5-9B-abliterated-gptq-pro-w4g128 \
  --served-model-name qwen35-9b-gptq-pro \
  --host 0.0.0.0 \
  --port 8011 \
  --tensor-parallel-size 1

Two-GPU shared-host launch:

CUDA_VISIBLE_DEVICES=0,1 \
python scripts/serve_vllm_qwen35.py \
  --model groxaxo/lukey03-Qwen3.5-9B-abliterated-gptq-pro-w4g128 \
  --served-model-name qwen35-9b-gptq-pro \
  --host 0.0.0.0 \
  --port 8012 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.4

When the fast path is active, the logs should contain:

Using MarlinLinearKernel for GPTQMarlinLinearMethod

Speed demon settings for future runs

If you want the same tradeoff in future Qwen3.5 quantization + deployment runs:

quantize with GPTQ-Pro, bits=4, group_size=128, sym=True, desc_act=False
keep batch_size=1 during Qwen3.5 quantization
deploy with BACKEND.MARLIN or the patched vLLM wrapper
prefer tensor_parallel_size=1 first, then scale to 2 GPUs if needed
on shared hosts, start with --gpu-memory-utilization 0.4
confirm the logs show MarlinLinearKernel

Notes

BACKEND.GPTQ_PRO is a functional runtime path, but it is not the throughput-optimized serving path.
I did not publish the AWQ exports for this model family as the recommended release because both symmetric and asymmetric AWQ failed the quality check on this benchmark setup.

Downloads last month: 4,006

Safetensors

Model size

9B params

Tensor type

BF16

I32

Model tree for groxaxo/lukey03-Qwen3.5-9B-abliterated-gptq-pro-w4g128

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

lukey03/Qwen3.5-9B-abliterated

Quantized

(4)

this model