SmoothMSE with tenary search (org: greedy).

38 minutes to quantize (1.97x speedup) in NVIDIA A100 20GB VRAM). See https://github.com/ModelCloud/GPTQModel/pull/2419 for detail

01. Usage

GPTQModel

from gptqmodel import GPTQModel
model = GPTQModel.from_quantized("namgyu-youn/Qwen3-8B-TEST-org", device="cuda:0")

vLLM

from vllm import LLM
llm = LLM(model="namgyu-youn/Qwen3-8B-TEST-org", dtype="float16")

02. Benchmark

result:

Perplexity (ppl; accuracy):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8535|±  |0.0156|
|     |       |strict-match    |     5|exact_match|↑  |0.6270|±  |0.0214|

Throughput: 2.31 requests/s, 2659.53 total tokens/s, 295.50 output tokens/s

repro:

# Perplexity
lm_eval --model vllm \
    --model_args pretrained="namgyu-youn/Qwen3-8B-tenary",dtype=float16,gpu_memory_utilization=0.85,enable_thinking=False,max_gen_toks=2048,max_model_len=8192,enforce_eager=True \
    --tasks gsm8k \
    --limit 512 \
    --output_path results \
    --apply_chat_template \
    --batch_size auto

# Throughput
vllm bench throughput \
  --input-len 256 \
  --output-len 256 \
  --model namgyu-youn/Qwen3-8B-tenary \
  --num-prompts 100 \
  --max-model-len 4096 \
  --enforce-eager
Downloads last month
3
Safetensors
Model size
8B params
Tensor type
F16
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for namgyu-youn/Qwen3-8B-tenary

Finetuned
Qwen/Qwen3-8B
Quantized
(269)
this model