EXAONE-4.0-1.2B AWQ (MLP) + GPTQ (Attention) W4A16

01. Quick Start

from vllm import LLM

llm = LLM(model="namgyu-youn/EXAONE-4.0-1.2B-LLMC-AWQ-W4", dtype="bfloat16", kv_cache_dtype="fp8")

02. Benchmark

Perplexity:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.0098|±  |0.0044|
|     |       |strict-match    |     5|exact_match|↑  |0.0000|±  |0.0000|

Throughput: 5.01 requests/s, 5774.56 total tokens/s, 641.62 output tokens/s

repro:

lm_eval --model vllm \
    --model_args pretrained=namgyu-youn/EXAONE-4.0-1.2B-LLMC-AWQ-W4,dtype=float16,gpu_memory_utilization=0.85,enable_thinking=False,max_gen_toks=2048,max_model_len=8192,enforce_eager=True \
    --tasks gsm8k \
    --limit 512 \
    --output_path results \
    --apply_chat_template \
    --batch_size auto

# For gptqmodel checkpoints (native GPTQ format), omit --quantization
# (vLLM auto-selects gptq_marlin)
vllm bench throughput \
  --input-len 256 \
  --output-len 256 \
  --model namgyu-youn/EXAONE-4.0-1.2B-LLMC-AWQ-W4 \
  --num-prompts 100 \
  --max-model-len 4096 \
  --enforce-eager
Downloads last month
6
Safetensors
Model size
0.6B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for namgyu-youn/EXAONE-4.0-1.2B-LLMC-AWQ-W4

Quantized
(33)
this model