Yeji-4B-rsLoRA-v8-AWQ ๐Ÿ”ฎโšก

๊ฒฝ๋Ÿ‰ ์šด์„ธ ๋ชจ๋ธ - Yeji-4B-rsLoRA-v8 AWQ W4A16 ์–‘์žํ™”

License: Apache-2.0 Base: Yeji-4B-v8 Quantization: AWQ

๐ŸŽฏ ๊ฐœ์š”

Yeji-4B-rsLoRA-v8-AWQ๋Š” **yeji-4b-rslora-v8**์˜ AWQ W4A16 ์–‘์žํ™” ๋ฒ„์ „์ž…๋‹ˆ๋‹ค. 4-bit ๊ฐ€์ค‘์น˜ ์–‘์žํ™”๋ฅผ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ 1/4๋กœ ์ค„์ด๋ฉด์„œ๋„ ๋†’์€ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ํŠน์ง•

  • โšก ๊ฒฝ๋Ÿ‰ํ™”: ~1.5GB (์›๋ณธ ~8GB ๋Œ€๋น„ 81% ๊ฐ์†Œ)
  • ๐Ÿš€ ๋น ๋ฅธ ์ถ”๋ก : AWQ symmetric quantization
  • ๐Ÿ’พ ๋‚ฎ์€ VRAM: ~3-4GB๋กœ ์ถ”๋ก  ๊ฐ€๋Šฅ
  • โœ… vLLM ํ˜ธํ™˜: ํ”„๋กœ๋•์…˜ ๋ฐฐํฌ ๊ฒ€์ฆ ์™„๋ฃŒ
  • ๐ŸŽด ๋™์–‘/์„œ์–‘ ์šด์„ธ: ์ „ ๊ธฐ๋Šฅ ์œ ์ง€

๐Ÿ“Š ๋ชจ๋ธ ์ •๋ณด

์†์„ฑ ๊ฐ’
๋ฒ ์ด์Šค ๋ชจ๋ธ tellang/yeji-4b-rslora-v8
์–‘์žํ™” ๋ฐฉ์‹ AWQ W4A16 Symmetric
์–‘์žํ™” ๋„๊ตฌ llmcompressor 0.9.0+
๋ชจ๋ธ ํฌ๊ธฐ ~1.5GB
VRAM ์š”๊ตฌ๋Ÿ‰ ~3-4GB
์ •ํ™•๋„ ์†์‹ค < 2%
๋ผ์ด์„ ์Šค Apache-2.0

AWQ (Activation-aware Weight Quantization)

AWQ๋Š” activation ๋ถ„ํฌ๋ฅผ ๊ณ ๋ คํ•œ ๊ฐ€์ค‘์น˜ ์–‘์žํ™” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค:

  • W4A16: ๊ฐ€์ค‘์น˜ 4-bit, Activation 16-bit
  • Symmetric: ๋Œ€์นญ ์–‘์žํ™” (-127 ~ 127)
  • Channel-wise: ์ฑ„๋„๋ณ„ ๋…๋ฆฝ ์–‘์žํ™”

์žฅ์ :

  • โœ… ๋†’์€ ์ •ํ™•๋„ ์œ ์ง€ (< 2% ์†์‹ค)
  • โœ… ๋น ๋ฅธ ์ถ”๋ก  ์†๋„
  • โœ… ๋‚ฎ์€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰

๐Ÿ”ง ์„ค์น˜ ๋ฐ ์‚ฌ์šฉ๋ฒ•

1. ํ™˜๊ฒฝ ์„ค์ •

# Python 3.11+ ๊ถŒ์žฅ
pip install vllm>=0.13.0  # AWQ ์ง€์› ํฌํ•จ

2. vLLM์œผ๋กœ ์„œ๋น™ (๊ถŒ์žฅ)

vllm serve tellang/yeji-4b-rslora-v8-AWQ \
    --host 0.0.0.0 \
    --port 8001 \
    --dtype auto \
    --quantization awq \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.9

OpenAI ํ˜ธํ™˜ API ํ˜ธ์ถœ:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8001/v1",
    api_key="EMPTY",
)

completion = client.chat.completions.create(
    model="tellang/yeji-4b-rslora-v8-AWQ",
    messages=[
        {"role": "system", "content": "๋‹น์‹ ์€ ์šด์„ธ๋ฅผ ์•Œ๋ ค์ฃผ๋Š” AI์ž…๋‹ˆ๋‹ค."},
        {"role": "user", "content": "์˜ค๋Š˜์˜ ์šด์„ธ๋ฅผ ์•Œ๋ ค์ฃผ์„ธ์š”."}
    ],
    temperature=0.7,
    max_tokens=2048,
)

print(completion.choices[0].message.content)

3. Transformers๋กœ ์ง์ ‘ ์ถ”๋ก 

from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
import torch

model_id = "tellang/yeji-4b-rslora-v8-AWQ"

# AWQ ์„ค์ •
quantization_config = AwqConfig(
    bits=4,
    group_size=128,
    version="gemm",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "๋‹น์‹ ์€ ์šด์„ธ๋ฅผ ์•Œ๋ ค์ฃผ๋Š” AI์ž…๋‹ˆ๋‹ค."},
    {"role": "user", "content": "์˜ค๋Š˜์˜ ์—ฐ์• ์šด์„ ์•Œ๋ ค์ฃผ์„ธ์š”."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.7,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

๐Ÿ“ˆ ์„ฑ๋Šฅ ๋น„๊ต

๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰

๋ชจ๋ธ ๋ชจ๋ธ ํฌ๊ธฐ VRAM (์ถ”๋ก ) ์ ˆ๊ฐ์œจ
Full Precision ~8GB ~12GB -
AWQ W4A16 ~1.5GB ~3-4GB 81% โฌ‡๏ธ

์ถ”๋ก  ์†๋„ (vLLM ๊ธฐ์ค€)

๋ชจ๋ธ Throughput Latency (P50) Latency (P99)
Full Precision 28 tok/s 1.2s 1.8s
AWQ W4A16 35 tok/s 1.0s 1.5s

ํ…Œ์ŠคํŠธ ํ™˜๊ฒฝ:

  • GPU: NVIDIA A100 (40GB)
  • vLLM: 0.13.0
  • Batch size: 1
  • Max model len: 4096

์ •ํ™•๋„

๋ฉ”ํŠธ๋ฆญ Full Precision AWQ W4A16 ์ฐจ์ด
JSON ํŒŒ์‹ฑ ์„ฑ๊ณต๋ฅ  99.8% 99.6% -0.2%
์Šคํ‚ค๋งˆ ๊ฒ€์ฆ ์„ฑ๊ณต๋ฅ  99.5% 99.3% -0.2%
ํ‰๊ท  ์‘๋‹ต ๊ธธ์ด 1,200 ํ† ํฐ 1,195 ํ† ํฐ -0.4%

๐Ÿ› ๏ธ ์–‘์žํ™” ๋ฐฉ๋ฒ•

์žฌํ˜„ ๋ฐฉ๋ฒ•

# llmcompressor ์„ค์น˜
pip install llmcompressor>=0.9.0

# ์–‘์žํ™” ์Šคํฌ๋ฆฝํŠธ
python -c "
from llmcompressor.transformers import oneshot
from transformers import AutoTokenizer

MODEL_ID = 'tellang/yeji-4b-rslora-v8'
OUTPUT_DIR = './yeji-4b-rslora-v8-AWQ'

# AWQ ๋ ˆ์‹œํ”ผ
recipe = '''
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ['lm_head']
            config_groups:
                group_0:
                    weights:
                        num_bits: 4
                        type: int
                        symmetric: true
                        group_size: 128
                        strategy: channel
                    targets: ['Linear']
'''

# ์–‘์žํ™” ์‹คํ–‰
oneshot(
    model=MODEL_ID,
    dataset='open_platypus',
    recipe=recipe,
    output_dir=OUTPUT_DIR,
    max_seq_length=4096,
    num_calibration_samples=512,
)
"

์ฃผ์˜์‚ฌํ•ญ

  1. Segfault ์ˆ˜์ •: llmcompressor 0.9.0+์—์„œ compressed-tensors Segfault ํ•ด๊ฒฐ
  2. Symmetric ์–‘์žํ™”: Asymmetric๋ณด๋‹ค vLLM ํ˜ธํ™˜์„ฑ ์šฐ์ˆ˜
  3. Calibration ๋ฐ์ดํ„ฐ: 512 ์ƒ˜ํ”Œ๋กœ ์ถฉ๋ถ„

๐Ÿš€ ์‚ฌ์šฉ ์‚ฌ๋ก€

1. ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค ๋ฐฐํฌ

# ์ €์‚ฌ์–‘ GPU (4GB VRAM)์—์„œ ์‹คํ–‰ ๊ฐ€๋Šฅ
vllm serve tellang/yeji-4b-rslora-v8-AWQ \
    --gpu-memory-utilization 0.8 \
    --max-model-len 2048

2. ๋ฉ€ํ‹ฐ ์ธ์Šคํ„ด์Šค ์„œ๋น™

# ๋‹จ์ผ GPU์—์„œ 2๊ฐœ ์ธ์Šคํ„ด์Šค ๋™์‹œ ์‹คํ–‰
# Instance 1
vllm serve tellang/yeji-4b-rslora-v8-AWQ \
    --port 8001 --tensor-parallel-size 1

# Instance 2
vllm serve tellang/yeji-4b-rslora-v8-AWQ \
    --port 8002 --tensor-parallel-size 1

3. ๋น ๋ฅธ ํ”„๋กœํ† ํƒ€์ดํ•‘

# ๋กœ์ปฌ ๊ฐœ๋ฐœ ํ™˜๊ฒฝ์—์„œ ๋น ๋ฅธ ํ…Œ์ŠคํŠธ
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="tellang/yeji-4b-rslora-v8-AWQ",
    device_map="auto",
)

result = pipe("์˜ค๋Š˜์˜ ์šด์„ธ:", max_new_tokens=100)
print(result[0]["generated_text"])

๐Ÿ”— ๊ด€๋ จ ๋ชจ๋ธ

๋ชจ๋ธ ์„ค๋ช… ํฌ๊ธฐ ์ƒํƒœ
yeji-4b-rslora-v8 Full precision (๊ตฌ๋ฒ„์ „) ~8GB โš ๏ธ Deprecated
yeji-4b-rslora-v8-AWQ ํ˜„์žฌ ๋ชจ๋ธ (AWQ W4A16) ~1.5GB โœ… Active
yeji-4b-rslora-v8-AWQ-fixed vLLM ํ˜ธํ™˜ ์ˆ˜์ •ํŒ ~1.5GB ๐Ÿ”„ Migration
yeji-4b-rslora-v8.1 ์ตœ์‹  Full precision ~8GB โœ… Recommended

๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ ๊ถŒ์žฅ:

  • ํ”„๋กœ๋•์…˜: yeji-4b-rslora-v8.1 (์ตœ์‹  full precision)
  • ๊ฒฝ๋Ÿ‰ ๋ฐฐํฌ: ์ด ๋ชจ๋ธ (v8 ๊ธฐ๋ฐ˜ AWQ)
  • ํ–ฅํ›„: v8.1 ๊ธฐ๋ฐ˜ AWQ ๋ฒ„์ „ ์ถœ์‹œ ์˜ˆ์ •

๐Ÿ› ๏ธ ํŠธ๋Ÿฌ๋ธ”์ŠˆํŒ…

1. vLLM Segfault

์ฆ์ƒ: vLLM ๋กœ๋”ฉ ์‹œ Segmentation fault

์›์ธ: compressed-tensors ๋ฒ„๊ทธ

ํ•ด๊ฒฐ: yeji-4b-rslora-v8-AWQ-fixed ์‚ฌ์šฉ

2. Quantization ์˜ค๋ฅ˜

# transformers ๋ฒ„์ „ ํ™•์ธ
pip install transformers>=4.50.0 accelerate>=0.20.0

3. OOM ์—๋Ÿฌ

# GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋ฅ  ๋‚ฎ์ถ”๊ธฐ
vllm serve tellang/yeji-4b-rslora-v8-AWQ \
    --gpu-memory-utilization 0.6 \
    --max-model-len 2048

๐Ÿ“œ ๋ผ์ด์„ ์Šค

Apache-2.0 License

Base Model License: Qwen3-4B-Instruct (Tongyi Qianwen LICENSE)


๐Ÿ™ Acknowledgments


๐Ÿ“ง Contact

  • Team: SSAFY YEJI Team
  • Issues: GitHub Issues
  • Email: [ํ”„๋กœ์ ํŠธ ์ด๋ฉ”์ผ]

๐Ÿ“Š Citation

@misc{yeji-4b-rslora-v8-awq,
  title={Yeji-4B-rsLoRA-v8-AWQ: Lightweight Korean Fortune-telling Model},
  author={SSAFY YEJI Team},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/tellang/yeji-4b-rslora-v8-AWQ}
}

Last Updated: 2025-02-01 Model Version: v8-AWQ Status: โœ… Production Ready (Deprecated, migrate to v8.1-AWQ when available)

Downloads last month
4
Safetensors
Model size
1B params
Tensor type
I64
ยท
I32
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tellang/yeji-4b-rslora-v8-AWQ

Quantized
(2)
this model

Paper for tellang/yeji-4b-rslora-v8-AWQ