AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Paper โข 2306.00978 โข Published โข 12
๊ฒฝ๋ ์ด์ธ ๋ชจ๋ธ - Yeji-4B-rsLoRA-v8 AWQ W4A16 ์์ํ
Yeji-4B-rsLoRA-v8-AWQ๋ **yeji-4b-rslora-v8**์ AWQ W4A16 ์์ํ ๋ฒ์ ์ ๋๋ค. 4-bit ๊ฐ์ค์น ์์ํ๋ฅผ ํตํด ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ 1/4๋ก ์ค์ด๋ฉด์๋ ๋์ ์ ํ๋๋ฅผ ์ ์งํฉ๋๋ค.
| ์์ฑ | ๊ฐ |
|---|---|
| ๋ฒ ์ด์ค ๋ชจ๋ธ | tellang/yeji-4b-rslora-v8 |
| ์์ํ ๋ฐฉ์ | AWQ W4A16 Symmetric |
| ์์ํ ๋๊ตฌ | llmcompressor 0.9.0+ |
| ๋ชจ๋ธ ํฌ๊ธฐ | ~1.5GB |
| VRAM ์๊ตฌ๋ | ~3-4GB |
| ์ ํ๋ ์์ค | < 2% |
| ๋ผ์ด์ ์ค | Apache-2.0 |
AWQ๋ activation ๋ถํฌ๋ฅผ ๊ณ ๋ คํ ๊ฐ์ค์น ์์ํ ๊ธฐ๋ฒ์ ๋๋ค:
์ฅ์ :
# Python 3.11+ ๊ถ์ฅ
pip install vllm>=0.13.0 # AWQ ์ง์ ํฌํจ
vllm serve tellang/yeji-4b-rslora-v8-AWQ \
--host 0.0.0.0 \
--port 8001 \
--dtype auto \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.9
OpenAI ํธํ API ํธ์ถ:
import openai
client = openai.OpenAI(
base_url="http://localhost:8001/v1",
api_key="EMPTY",
)
completion = client.chat.completions.create(
model="tellang/yeji-4b-rslora-v8-AWQ",
messages=[
{"role": "system", "content": "๋น์ ์ ์ด์ธ๋ฅผ ์๋ ค์ฃผ๋ AI์
๋๋ค."},
{"role": "user", "content": "์ค๋์ ์ด์ธ๋ฅผ ์๋ ค์ฃผ์ธ์."}
],
temperature=0.7,
max_tokens=2048,
)
print(completion.choices[0].message.content)
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
import torch
model_id = "tellang/yeji-4b-rslora-v8-AWQ"
# AWQ ์ค์
quantization_config = AwqConfig(
bits=4,
group_size=128,
version="gemm",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
device_map="auto",
)
messages = [
{"role": "system", "content": "๋น์ ์ ์ด์ธ๋ฅผ ์๋ ค์ฃผ๋ AI์
๋๋ค."},
{"role": "user", "content": "์ค๋์ ์ฐ์ ์ด์ ์๋ ค์ฃผ์ธ์."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=2048,
temperature=0.7,
do_sample=True,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
| ๋ชจ๋ธ | ๋ชจ๋ธ ํฌ๊ธฐ | VRAM (์ถ๋ก ) | ์ ๊ฐ์จ |
|---|---|---|---|
| Full Precision | ~8GB | ~12GB | - |
| AWQ W4A16 | ~1.5GB | ~3-4GB | 81% โฌ๏ธ |
| ๋ชจ๋ธ | Throughput | Latency (P50) | Latency (P99) |
|---|---|---|---|
| Full Precision | 28 tok/s | 1.2s | 1.8s |
| AWQ W4A16 | 35 tok/s | 1.0s | 1.5s |
ํ ์คํธ ํ๊ฒฝ:
| ๋ฉํธ๋ฆญ | Full Precision | AWQ W4A16 | ์ฐจ์ด |
|---|---|---|---|
| JSON ํ์ฑ ์ฑ๊ณต๋ฅ | 99.8% | 99.6% | -0.2% |
| ์คํค๋ง ๊ฒ์ฆ ์ฑ๊ณต๋ฅ | 99.5% | 99.3% | -0.2% |
| ํ๊ท ์๋ต ๊ธธ์ด | 1,200 ํ ํฐ | 1,195 ํ ํฐ | -0.4% |
# llmcompressor ์ค์น
pip install llmcompressor>=0.9.0
# ์์ํ ์คํฌ๋ฆฝํธ
python -c "
from llmcompressor.transformers import oneshot
from transformers import AutoTokenizer
MODEL_ID = 'tellang/yeji-4b-rslora-v8'
OUTPUT_DIR = './yeji-4b-rslora-v8-AWQ'
# AWQ ๋ ์ํผ
recipe = '''
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ['lm_head']
config_groups:
group_0:
weights:
num_bits: 4
type: int
symmetric: true
group_size: 128
strategy: channel
targets: ['Linear']
'''
# ์์ํ ์คํ
oneshot(
model=MODEL_ID,
dataset='open_platypus',
recipe=recipe,
output_dir=OUTPUT_DIR,
max_seq_length=4096,
num_calibration_samples=512,
)
"
# ์ ์ฌ์ GPU (4GB VRAM)์์ ์คํ ๊ฐ๋ฅ
vllm serve tellang/yeji-4b-rslora-v8-AWQ \
--gpu-memory-utilization 0.8 \
--max-model-len 2048
# ๋จ์ผ GPU์์ 2๊ฐ ์ธ์คํด์ค ๋์ ์คํ
# Instance 1
vllm serve tellang/yeji-4b-rslora-v8-AWQ \
--port 8001 --tensor-parallel-size 1
# Instance 2
vllm serve tellang/yeji-4b-rslora-v8-AWQ \
--port 8002 --tensor-parallel-size 1
# ๋ก์ปฌ ๊ฐ๋ฐ ํ๊ฒฝ์์ ๋น ๋ฅธ ํ
์คํธ
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="tellang/yeji-4b-rslora-v8-AWQ",
device_map="auto",
)
result = pipe("์ค๋์ ์ด์ธ:", max_new_tokens=100)
print(result[0]["generated_text"])
| ๋ชจ๋ธ | ์ค๋ช | ํฌ๊ธฐ | ์ํ |
|---|---|---|---|
| yeji-4b-rslora-v8 | Full precision (๊ตฌ๋ฒ์ ) | ~8GB | โ ๏ธ Deprecated |
| yeji-4b-rslora-v8-AWQ | ํ์ฌ ๋ชจ๋ธ (AWQ W4A16) | ~1.5GB | โ Active |
| yeji-4b-rslora-v8-AWQ-fixed | vLLM ํธํ ์์ ํ | ~1.5GB | ๐ Migration |
| yeji-4b-rslora-v8.1 | ์ต์ Full precision | ~8GB | โ Recommended |
๋ง์ด๊ทธ๋ ์ด์ ๊ถ์ฅ:
yeji-4b-rslora-v8.1 (์ต์ full precision)์ฆ์: vLLM ๋ก๋ฉ ์ Segmentation fault
์์ธ: compressed-tensors ๋ฒ๊ทธ
ํด๊ฒฐ: yeji-4b-rslora-v8-AWQ-fixed ์ฌ์ฉ
# transformers ๋ฒ์ ํ์ธ
pip install transformers>=4.50.0 accelerate>=0.20.0
# GPU ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋ฅ ๋ฎ์ถ๊ธฐ
vllm serve tellang/yeji-4b-rslora-v8-AWQ \
--gpu-memory-utilization 0.6 \
--max-model-len 2048
Apache-2.0 License
Base Model License: Qwen3-4B-Instruct (Tongyi Qianwen LICENSE)
@misc{yeji-4b-rslora-v8-awq,
title={Yeji-4B-rsLoRA-v8-AWQ: Lightweight Korean Fortune-telling Model},
author={SSAFY YEJI Team},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/tellang/yeji-4b-rslora-v8-AWQ}
}
Last Updated: 2025-02-01 Model Version: v8-AWQ Status: โ Production Ready (Deprecated, migrate to v8.1-AWQ when available)