GovOn-EXAONE-AWQ-v2

Introduction

GovOn-EXAONE-AWQ-v2 is an optimized 4-bit quantized version of GovOn-EXAONE-Merged-v2, designed for On-Device and low-latency deployment in civil service environments.

By applying AWQ (Activation-aware Weight Quantization) (W4A16g128), we reduced the model size by 66.1% (from 14.56GB to 4.94GB) while preserving domain-specific performance. This enables high-quality Korean civil complaint assistance on consumer-grade GPUs with as little as 8GB of VRAM.

Quickstart

We recommend using vLLM or AutoAWQ for optimized inference.

Using AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = "umyunsang/GovOn-EXAONE-AWQ-v2"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True, trust_remote_code=True)

# (Inference code same as Merged-v2)

Specifications

Model Details

Source Model: umyunsang/GovOn-EXAONE-Merged-v2
Quantization Method: AWQ (Weight-only 4-bit)
Config: W4A16, Group Size 128, Zero Point True
Model Size: 4.94 GB (BF16 Original: 14.56 GB)
VRAM Required: ~6.5 GB (Inference)

Efficiency

Compression Ratio: 2.95x
Size Reduction: 66.1%
Calibration: 512 domain-specific civil complaint samples

Limitation and Usage

Quantization Loss: While AWQ minimizes performance drops, slight deviations in CoT (<thought>) or nuanced reasoning might occur compared to the BF16 version.
Infrastructure: Optimized for NVIDIA GPUs (Ampere architecture or newer recommended).

License

This model is licensed under the Apache License 2.0. However, users must also comply with the EXAONE AI Model License Agreement of the base model.

Downloads last month: 81

Safetensors

Model size

8B params

Tensor type

I32

F16

Model tree for umyunsang/GovOn-EXAONE-AWQ-v2

Base model

LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct

Finetuned

LGAI-EXAONE/EXAONE-Deep-7.8B

Finetuned

umyunsang/GovOn-EXAONE-Merged-v2

Quantized

(1)

this model