01. Usage

GPTQModel

from gptqmodel import GPTQModel
model = GPTQModel.from_quantized("namgyu-youn/Qwen3-8B-greedy", device="cuda:0")

vLLM

from vllm import LLM
llm = LLM(model="namgyu-youn/Qwen3-8B-greedy", dtype="float16")

02. Benchmark Result

repro:

# Perplexity
lm_eval --model vllm \
    --model_args pretrained="namgyu-youn/Qwen3-8B-greedy",dtype=float16,gpu_memory_utilization=0.85,enable_thinking=False,max_gen_toks=2048,max_model_len=8192,enforce_eager=True \
    --tasks gsm8k \
    --limit 512 \
    --output_path results \
    --apply_chat_template \
    --batch_size auto

# Throughput
vllm bench throughput \
  --input-len 256 \
  --output-len 256 \
  --model namgyu-youn/Qwen3-8B-greedy \
  --num-prompts 100 \
  --max-model-len 4096 \
  --enforce-eager

Perplexity (ppl; accuracy):

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8535	±	0.0156
		strict-match	5	exact_match	↑	0.6270	±	0.0214

Throughput: 2.31 requests/s, 2659.84 total tokens/s, 295.54 output tokens/s

Downloads last month: 6

Safetensors

Model size

8B params

Tensor type

F16

I32

Model tree for namgyu-youn/Qwen3-8B-greedy

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Quantized

(269)

this model