civil-complaint-exaone-awq

umyunsang/civil-complaint-exaone-merged의 AWQ W4A16g128 4-bit 양자화 버전입니다. 온디바이스 AI 배포를 위해 최적화되었습니다.

Model Tree

LGAI-EXAONE/EXAONE-Deep-7.8B  (기반 모델)
        |
        |  + umyunsang/civil-complaint-exaone-lora  (QLoRA 어댑터, rank=16)
        v
umyunsang/civil-complaint-exaone-merged  (BF16, 14.56 GB)
        |
        |  AWQ W4A16g128 양자화 (AutoAWQ)
        v
umyunsang/civil-complaint-exaone-awq  (4-bit, 4.94 GB)  <- 이 모델

Model Description

항목	내용
원본 모델	umyunsang/civil-complaint-exaone-merged
양자화 방식	AWQ (Activation-aware Weight Quantization)
양자화 설정	W4A16g128 (4-bit 가중치, 16-bit 활성화, group_size=128)
모델 크기	4.94 GB (safetensors)
압축률	2.95x (BF16 14.56 GB → 4-bit 4.94 GB)
크기 감소	66.1%
GPU VRAM	~5-7 GB (추론 시)

Intended Use

한국 민원 처리 업무 지원 (온디바이스/엣지 AI 배포용):

민원 분류: 환경, 교통, 시설, 민원서비스, 복지, 문화, 경제, 교육, 안전, 기타
민원 답변 생성: 표준 서식에 맞춘 공손하고 명확한 답변 작성
경량 배포: 소비자급 GPU (8GB VRAM)에서도 실행 가능

Usage

AutoAWQ 사용

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch

model_id = "umyunsang/civil-complaint-exaone-awq"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoAWQForCausalLM.from_quantized(
    model_id,
    fuse_layers=True,
    trust_remote_code=True,
    safetensors=True,
)

instruction = "다음 민원에 대해 단계적으로 분석하고, 표준 서식에 맞춰 공손하고 명확한 답변을 작성하세요."
complaint = "[Category: traffic]\nComplaint Content: 우리 동네 도로에 포트홀이 생겨서 차량 통행에 위험합니다. 빠른 조치 부탁드립니다."

messages = [{"role": "user", "content": f"{instruction}\n\n{complaint}"}]

input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.6,
        top_p=0.95,
        eos_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)

vLLM 사용 (권장)

from vllm import LLM, SamplingParams

llm = LLM(
    model="umyunsang/civil-complaint-exaone-awq",
    quantization="awq",
    trust_remote_code=True,
)

sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512)

instruction = "다음 민원에 대해 단계적으로 분석하고, 표준 서식에 맞춰 공손하고 명확한 답변을 작성하세요."
complaint = "[Category: traffic]\nComplaint Content: 우리 동네 도로에 포트홀이 생겨서 차량 통행에 위험합니다."

prompt = f"[|system|]You are a helpful assistant.[|endofturn|]\n[|user|]{instruction}\n\n{complaint}[|endofturn|]\n[|assistant|]<thought>\n"

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Quantization Details

설정	값
알고리즘	AWQ (Activation-aware Weight Quantization)
가중치 비트	4-bit
활성화 비트	16-bit (FP16)
Group size	128
Zero-point	True
버전	GEMM
캘리브레이션 데이터	512 샘플 (민원 도메인 학습 데이터)

Evaluation Results

지표	BF16 (merged)	AWQ 4-bit	변화
Perplexity	-	3.20	-
BLEU Score	-	17.32	-
ROUGE-L Score	-	18.28	-
평균 추론 지연	-	9.29s	-
처리량	-	13.8 tok/s	-
GPU VRAM	~30 GB	~5-7 GB	-76%
모델 크기	14.56 GB	4.94 GB	-66%

Hardware Requirements

항목	최소 사양	권장 사양
GPU VRAM	6 GB	8 GB
RAM	8 GB	16 GB
GPU	RTX 3060 이상	RTX 3080 이상

Limitations

EXAONE-Deep-7.8B의 추론 태그(<thought>...</thought>)를 사용하는 Reasoning 모델입니다
4-bit 양자화로 인해 BF16 원본 대비 약간의 성능 저하가 있을 수 있습니다
한국어 민원 처리 도메인에 특화되어 있으며, 다른 도메인에서의 성능은 보장되지 않습니다
생성된 답변은 반드시 담당자가 검토 후 사용해야 합니다

License

EXAONE 모델은 EXAONE AI Model License Agreement 1.1을 따릅니다.

Downloads last month: 17

Safetensors

Model size

8B params

Tensor type

I32

BF16

Model tree for umyunsang/civil-complaint-exaone-awq

Base model

umyunsang/civil-complaint-exaone-merged

Quantized

(1)

this model