AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Paper โข 2306.00978 โข Published โข 12
vLLM ํธํ ์์ ํ - Segfault ํด๊ฒฐ๋ AWQ ๋ชจ๋ธ
Yeji-4B-rsLoRA-v8-AWQ-fixed๋ vLLM 0.13.0+ ํธํ์ฑ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ AWQ ์์ํ ๋ชจ๋ธ์ ๋๋ค. compressed-tensors์ Segfault ๋ฒ๊ทธ๋ฅผ ์์ ํ์ฌ ์์ ์ ์ธ ํ๋ก๋์ ๋ฐฐํฌ๋ฅผ ๋ณด์ฅํฉ๋๋ค.
| ์์ฑ | ๊ฐ |
|---|---|
| ๋ฒ ์ด์ค ๋ชจ๋ธ | tellang/yeji-4b-rslora-v8 |
| ์์ํ ๋ฐฉ์ | AWQ W4A16 Symmetric |
| ์์ํ ๋๊ตฌ | llmcompressor 0.9.0+ |
| ๋ชจ๋ธ ํฌ๊ธฐ | ~1.5GB |
| VRAM ์๊ตฌ๋ | ~3-4GB |
| vLLM ๋ฒ์ | 0.13.0+ |
| ๋ผ์ด์ ์ค | Apache-2.0 |
Segmentation Fault: vLLM ๋ก๋ฉ ์ ํฌ๋์
Segmentation fault (core dumped)
์์ธ: compressed-tensors 0.x ๋ฒ๊ทธ
compressed_tensors_format ํ๋ ๋๋ฝ# Python 3.11+ ๊ถ์ฅ
pip install vllm>=0.13.0
vllm serve tellang/yeji-4b-rslora-v8-AWQ-fixed \
--host 0.0.0.0 \
--port 8001 \
--dtype auto \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.9
์ฑ๊ณต ๋ก๊ทธ ์์:
INFO 01-15 12:00:00 llm_engine.py:98] Initializing an LLM engine
INFO 01-15 12:00:01 weight_utils.py:193] Using model weights format awq
INFO 01-15 12:00:05 model_runner.py:146] Loading model weights took 1.2 GB
INFO 01-15 12:00:06 gpu_executor.py:83] # GPU blocks: 8192, # CPU blocks: 2048
INFO 01-15 12:00:06 api_server.py:210] vLLM API server started at http://0.0.0.0:8001
import openai
client = openai.OpenAI(
base_url="http://localhost:8001/v1",
api_key="EMPTY",
)
completion = client.chat.completions.create(
model="tellang/yeji-4b-rslora-v8-AWQ-fixed",
messages=[
{"role": "system", "content": "๋น์ ์ ์ด์ธ๋ฅผ ์๋ ค์ฃผ๋ AI์
๋๋ค."},
{"role": "user", "content": "์ค๋์ ์ด์ธ๋ฅผ ์๋ ค์ฃผ์ธ์."}
],
temperature=0.7,
max_tokens=2048,
)
print(completion.choices[0].message.content)
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
import torch
model_id = "tellang/yeji-4b-rslora-v8-AWQ-fixed"
# AWQ ์ค์
quantization_config = AwqConfig(
bits=4,
group_size=128,
version="gemm",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
device_map="auto",
)
messages = [
{"role": "system", "content": "๋น์ ์ ์ด์ธ๋ฅผ ์๋ ค์ฃผ๋ AI์
๋๋ค."},
{"role": "user", "content": "์ค๋์ ์ฐ์ ์ด์ ์๋ ค์ฃผ์ธ์."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=2048,
temperature=0.7,
do_sample=True,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
| ๋ฉํธ๋ฆญ | ๊ธฐ์กด AWQ | fixed ๋ฒ์ |
|---|---|---|
| ๋ก๋ฉ ์ฑ๊ณต๋ฅ | โ Segfault | โ 100% |
| vLLM 0.13.0 ํธํ | โ ๋ถ๊ฐ | โ ์๋ฒฝ |
| compressed-tensors | โ ๏ธ ๋ฒ๊ทธ | โ ์์ |
| ๋ฐฐ์น ํฌ๊ธฐ | Throughput | Latency (P50) | Latency (P99) |
|---|---|---|---|
| 1 | 35 tok/s | 1.0s | 1.5s |
| 4 | 110 tok/s | 1.3s | 2.0s |
| 8 | 180 tok/s | 1.7s | 2.8s |
ํ ์คํธ ํ๊ฒฝ:
| ๋ฐฐ์น ํฌ๊ธฐ | VRAM ์ฌ์ฉ๋ | GPU ๋ฉ๋ชจ๋ฆฌ ํ์ฉ๋ฅ |
|---|---|---|
| 1 | 3.2GB | 32% |
| 4 | 5.8GB | 58% |
| 8 | 8.5GB | 85% |
from llmcompressor.transformers import oneshot
from transformers import AutoTokenizer
MODEL_ID = "tellang/yeji-4b-rslora-v8"
OUTPUT_DIR = "./yeji-4b-rslora-v8-AWQ-fixed"
# AWQ ๋ ์ํผ (llmcompressor 0.9.0+)
recipe = """
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 4
type: int
symmetric: true
group_size: 128
strategy: channel
targets: ["Linear"]
"""
# ์์ํ ์คํ
oneshot(
model=MODEL_ID,
dataset="open_platypus",
recipe=recipe,
output_dir=OUTPUT_DIR,
max_seq_length=4096,
num_calibration_samples=512,
)
print(f"Quantized model saved to {OUTPUT_DIR}")
# vLLM ํธํ์ฑ ํ
์คํธ
python -c "
from vllm import LLM
llm = LLM(
model='tellang/yeji-4b-rslora-v8-AWQ-fixed',
quantization='awq',
max_model_len=4096,
)
outputs = llm.generate('์ค๋์ ์ด์ธ:', max_tokens=100)
print(outputs[0].outputs[0].text)
"
| ๋ชจ๋ธ | ์ค๋ช | ํฌ๊ธฐ | Segfault | ์ํ |
|---|---|---|---|---|
| yeji-4b-rslora-v8 | Full precision (๊ตฌ๋ฒ์ ) | ~8GB | - | โ ๏ธ Deprecated |
| yeji-4b-rslora-v8-AWQ | AWQ (๋ฒ๊ทธ ์์) | ~1.5GB | โ Yes | ๐ Migrate |
| yeji-4b-rslora-v8-AWQ-fixed | ํ์ฌ ๋ชจ๋ธ (์์ ํ) | ~1.5GB | โ Fixed | โ Active |
| yeji-4b-rslora-v8.1 | ์ต์ Full precision | ~8GB | - | โ Recommended |
๋ง์ด๊ทธ๋ ์ด์ ๊ฐ์ด๋:
yeji-4b-rslora-v8-AWQ โ ์ด ๋ชจ๋ธ๋ก ๊ต์ฒดyeji-4b-rslora-v8.1 (์ต์ full precision) ๊ถ์ฅ# vLLM 0.13.0+ ํ์
pip install --upgrade vllm>=0.13.0
# GPU ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋ฅ ๋ฎ์ถ๊ธฐ
vllm serve tellang/yeji-4b-rslora-v8-AWQ-fixed \
--gpu-memory-utilization 0.7 \
--max-model-len 2048
# transformers ๋ฒ์ ํ์ธ
pip install --upgrade transformers>=4.50.0 accelerate>=0.20.0
Apache-2.0 License
Base Model License: Qwen3-4B-Instruct (Tongyi Qianwen LICENSE)
@misc{yeji-4b-rslora-v8-awq-fixed,
title={Yeji-4B-rsLoRA-v8-AWQ-fixed: vLLM-Compatible AWQ Model},
author={SSAFY YEJI Team},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/tellang/yeji-4b-rslora-v8-AWQ-fixed}
}
Last Updated: 2025-02-01 Model Version: v8-AWQ-fixed Status: โ Production Ready (Stable vLLM 0.13.0+)