GovOn-EXAONE-AWQ-v2

Introduction

GovOn-EXAONE-AWQ-v2 is an optimized 4-bit quantized version of GovOn-EXAONE-Merged-v2, designed for On-Device and low-latency deployment in civil service environments.

By applying AWQ (Activation-aware Weight Quantization) (W4A16g128), we reduced the model size by 66.1% (from 14.56GB to 4.94GB) while preserving domain-specific performance. This enables high-quality Korean civil complaint assistance on consumer-grade GPUs with as little as 8GB of VRAM.

Quickstart

We recommend using vLLM or AutoAWQ for optimized inference.

Using AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = "umyunsang/GovOn-EXAONE-AWQ-v2"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True, trust_remote_code=True)

# (Inference code same as Merged-v2)

Specifications

Model Details

  • Source Model: umyunsang/GovOn-EXAONE-Merged-v2
  • Quantization Method: AWQ (Weight-only 4-bit)
  • Config: W4A16, Group Size 128, Zero Point True
  • Model Size: 4.94 GB (BF16 Original: 14.56 GB)
  • VRAM Required: ~6.5 GB (Inference)

Efficiency

  • Compression Ratio: 2.95x
  • Size Reduction: 66.1%
  • Calibration: 512 domain-specific civil complaint samples

Limitation and Usage

  1. Quantization Loss: While AWQ minimizes performance drops, slight deviations in CoT (<thought>) or nuanced reasoning might occur compared to the BF16 version.
  2. Infrastructure: Optimized for NVIDIA GPUs (Ampere architecture or newer recommended).

License

This model is licensed under the Apache License 2.0. However, users must also comply with the EXAONE AI Model License Agreement of the base model.

Downloads last month
81
Safetensors
Model size
8B params
Tensor type
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for umyunsang/GovOn-EXAONE-AWQ-v2