SmoothMSE with tenary search (org: greedy).
38 minutes to quantize (1.97x speedup) in NVIDIA A100 20GB VRAM). See https://github.com/ModelCloud/GPTQModel/pull/2419 for detail
01. Usage
GPTQModel
from gptqmodel import GPTQModel
model = GPTQModel.from_quantized("namgyu-youn/Qwen3-8B-TEST-org", device="cuda:0")
vLLM
from vllm import LLM
llm = LLM(model="namgyu-youn/Qwen3-8B-TEST-org", dtype="float16")
02. Benchmark
result:
Perplexity (ppl; accuracy):
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8535|± |0.0156|
| | |strict-match | 5|exact_match|↑ |0.6270|± |0.0214|
Throughput: 2.31 requests/s, 2659.53 total tokens/s, 295.50 output tokens/s
repro:
# Perplexity
lm_eval --model vllm \
--model_args pretrained="namgyu-youn/Qwen3-8B-tenary",dtype=float16,gpu_memory_utilization=0.85,enable_thinking=False,max_gen_toks=2048,max_model_len=8192,enforce_eager=True \
--tasks gsm8k \
--limit 512 \
--output_path results \
--apply_chat_template \
--batch_size auto
# Throughput
vllm bench throughput \
--input-len 256 \
--output-len 256 \
--model namgyu-youn/Qwen3-8B-tenary \
--num-prompts 100 \
--max-model-len 4096 \
--enforce-eager
- Downloads last month
- 3