civil-complaint-exaone-awq
umyunsang/civil-complaint-exaone-mergedμ AWQ W4A16g128 4-bit μμν λ²μ μ λλ€. μ¨λλ°μ΄μ€ AI λ°°ν¬λ₯Ό μν΄ μ΅μ νλμμ΅λλ€.
Model Tree
LGAI-EXAONE/EXAONE-Deep-7.8B (κΈ°λ° λͺ¨λΈ)
|
| + umyunsang/civil-complaint-exaone-lora (QLoRA μ΄λν°, rank=16)
v
umyunsang/civil-complaint-exaone-merged (BF16, 14.56 GB)
|
| AWQ W4A16g128 μμν (AutoAWQ)
v
umyunsang/civil-complaint-exaone-awq (4-bit, 4.94 GB) <- μ΄ λͺ¨λΈ
Model Description
| νλͺ© | λ΄μ© |
|---|---|
| μλ³Έ λͺ¨λΈ | umyunsang/civil-complaint-exaone-merged |
| μμν λ°©μ | AWQ (Activation-aware Weight Quantization) |
| μμν μ€μ | W4A16g128 (4-bit κ°μ€μΉ, 16-bit νμ±ν, group_size=128) |
| λͺ¨λΈ ν¬κΈ° | 4.94 GB (safetensors) |
| μμΆλ₯ | 2.95x (BF16 14.56 GB β 4-bit 4.94 GB) |
| ν¬κΈ° κ°μ | 66.1% |
| GPU VRAM | ~5-7 GB (μΆλ‘ μ) |
Intended Use
νκ΅ λ―Όμ μ²λ¦¬ μ 무 μ§μ (μ¨λλ°μ΄μ€/μ£μ§ AI λ°°ν¬μ©):
- λ―Όμ λΆλ₯: νκ²½, κ΅ν΅, μμ€, λ―ΌμμλΉμ€, 볡μ§, λ¬Έν, κ²½μ , κ΅μ‘, μμ , κΈ°ν
- λ―Όμ λ΅λ³ μμ±: νμ€ μμμ λ§μΆ 곡μνκ³ λͺ νν λ΅λ³ μμ±
- κ²½λ λ°°ν¬: μλΉμκΈ GPU (8GB VRAM)μμλ μ€ν κ°λ₯
Usage
AutoAWQ μ¬μ©
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch
model_id = "umyunsang/civil-complaint-exaone-awq"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoAWQForCausalLM.from_quantized(
model_id,
fuse_layers=True,
trust_remote_code=True,
safetensors=True,
)
instruction = "λ€μ λ―Όμμ λν΄ λ¨κ³μ μΌλ‘ λΆμνκ³ , νμ€ μμμ λ§μΆ° 곡μνκ³ λͺ
νν λ΅λ³μ μμ±νμΈμ."
complaint = "[Category: traffic]\nComplaint Content: μ°λ¦¬ λλ€ λλ‘μ ν¬νΈνμ΄ μ겨μ μ°¨λ ν΅νμ μνν©λλ€. λΉ λ₯Έ μ‘°μΉ λΆνλ립λλ€."
messages = [{"role": "user", "content": f"{instruction}\n\n{complaint}"}]
input_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=512,
do_sample=True,
temperature=0.6,
top_p=0.95,
eos_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)
vLLM μ¬μ© (κΆμ₯)
from vllm import LLM, SamplingParams
llm = LLM(
model="umyunsang/civil-complaint-exaone-awq",
quantization="awq",
trust_remote_code=True,
)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512)
instruction = "λ€μ λ―Όμμ λν΄ λ¨κ³μ μΌλ‘ λΆμνκ³ , νμ€ μμμ λ§μΆ° 곡μνκ³ λͺ
νν λ΅λ³μ μμ±νμΈμ."
complaint = "[Category: traffic]\nComplaint Content: μ°λ¦¬ λλ€ λλ‘μ ν¬νΈνμ΄ μ겨μ μ°¨λ ν΅νμ μνν©λλ€."
prompt = f"[|system|]You are a helpful assistant.[|endofturn|]\n[|user|]{instruction}\n\n{complaint}[|endofturn|]\n[|assistant|]<thought>\n"
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
Quantization Details
| μ€μ | κ° |
|---|---|
| μκ³ λ¦¬μ¦ | AWQ (Activation-aware Weight Quantization) |
| κ°μ€μΉ λΉνΈ | 4-bit |
| νμ±ν λΉνΈ | 16-bit (FP16) |
| Group size | 128 |
| Zero-point | True |
| λ²μ | GEMM |
| μΊλ¦¬λΈλ μ΄μ λ°μ΄ν° | 512 μν (λ―Όμ λλ©μΈ νμ΅ λ°μ΄ν°) |
Evaluation Results
| μ§ν | BF16 (merged) | AWQ 4-bit | λ³ν |
|---|---|---|---|
| Perplexity | - | 3.20 | - |
| BLEU Score | - | 17.32 | - |
| ROUGE-L Score | - | 18.28 | - |
| νκ· μΆλ‘ μ§μ° | - | 9.29s | - |
| μ²λ¦¬λ | - | 13.8 tok/s | - |
| GPU VRAM | ~30 GB | ~5-7 GB | -76% |
| λͺ¨λΈ ν¬κΈ° | 14.56 GB | 4.94 GB | -66% |
Hardware Requirements
| νλͺ© | μ΅μ μ¬μ | κΆμ₯ μ¬μ |
|---|---|---|
| GPU VRAM | 6 GB | 8 GB |
| RAM | 8 GB | 16 GB |
| GPU | RTX 3060 μ΄μ | RTX 3080 μ΄μ |
Limitations
- EXAONE-Deep-7.8Bμ μΆλ‘ νκ·Έ(
<thought>...</thought>)λ₯Ό μ¬μ©νλ Reasoning λͺ¨λΈμ λλ€ - 4-bit μμνλ‘ μΈν΄ BF16 μλ³Έ λλΉ μ½κ°μ μ±λ₯ μ νκ° μμ μ μμ΅λλ€
- νκ΅μ΄ λ―Όμ μ²λ¦¬ λλ©μΈμ νΉνλμ΄ μμΌλ©°, λ€λ₯Έ λλ©μΈμμμ μ±λ₯μ 보μ₯λμ§ μμ΅λλ€
- μμ±λ λ΅λ³μ λ°λμ λ΄λΉμκ° κ²ν ν μ¬μ©ν΄μΌ ν©λλ€
License
EXAONE λͺ¨λΈμ EXAONE AI Model License Agreement 1.1μ λ°λ¦ λλ€.
- Downloads last month
- 17
Model tree for umyunsang/civil-complaint-exaone-awq
Base model
umyunsang/civil-complaint-exaone-merged