Yeji-4B-rsLoRA-v8-AWQ-fixed ๐Ÿ”ฎโšก๐Ÿ”ง

vLLM ํ˜ธํ™˜ ์ˆ˜์ •ํŒ - Segfault ํ•ด๊ฒฐ๋œ AWQ ๋ชจ๋ธ

License: Apache-2.0 Base: Yeji-4B-v8 Quantization: AWQ vLLM: 0.13.0+

๐ŸŽฏ ๊ฐœ์š”

Yeji-4B-rsLoRA-v8-AWQ-fixed๋Š” vLLM 0.13.0+ ํ˜ธํ™˜์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ AWQ ์–‘์žํ™” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. compressed-tensors์˜ Segfault ๋ฒ„๊ทธ๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ์•ˆ์ •์ ์ธ ํ”„๋กœ๋•์…˜ ๋ฐฐํฌ๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ํŠน์ง•

  • โœ… vLLM Segfault ํ•ด๊ฒฐ: llmcompressor 0.9.0+ ์‚ฌ์šฉ
  • โšก ๊ฒฝ๋Ÿ‰ํ™”: ~1.5GB (์›๋ณธ ~8GB ๋Œ€๋น„ 81% ๊ฐ์†Œ)
  • ๐Ÿš€ ๋น ๋ฅธ ์ถ”๋ก : AWQ W4A16 symmetric quantization
  • ๐Ÿ’พ ๋‚ฎ์€ VRAM: ~3-4GB๋กœ ์ถ”๋ก  ๊ฐ€๋Šฅ
  • ๐Ÿ”ง ํ”„๋กœ๋•์…˜ ๊ฒ€์ฆ: vLLM 0.13.0 ์™„๋ฒฝ ํ˜ธํ™˜

๐Ÿ“Š ๋ชจ๋ธ ์ •๋ณด

์†์„ฑ ๊ฐ’
๋ฒ ์ด์Šค ๋ชจ๋ธ tellang/yeji-4b-rslora-v8
์–‘์žํ™” ๋ฐฉ์‹ AWQ W4A16 Symmetric
์–‘์žํ™” ๋„๊ตฌ llmcompressor 0.9.0+
๋ชจ๋ธ ํฌ๊ธฐ ~1.5GB
VRAM ์š”๊ตฌ๋Ÿ‰ ~3-4GB
vLLM ๋ฒ„์ „ 0.13.0+
๋ผ์ด์„ ์Šค Apache-2.0

๐Ÿ”ง ์ˆ˜์ • ๋‚ด์šฉ

๊ธฐ์กด ๋ชจ๋ธ (yeji-4b-rslora-v8-AWQ) ๋ฌธ์ œ์ 

  1. Segmentation Fault: vLLM ๋กœ๋”ฉ ์‹œ ํฌ๋ž˜์‹œ

    Segmentation fault (core dumped)
    
  2. ์›์ธ: compressed-tensors 0.x ๋ฒ„๊ทธ

    • compressed_tensors_format ํ•„๋“œ ๋ˆ„๋ฝ
    • vLLM 0.13.0 ํ˜ธํ™˜์„ฑ ๋ฌธ์ œ

์ˆ˜์ • ์‚ฌํ•ญ

  1. llmcompressor 0.9.0+ ์‚ฌ์šฉ: compressed-tensors ๋ฒ„๊ทธ ์ˆ˜์ •
  2. Symmetric ์–‘์žํ™”: vLLM ํ˜ธํ™˜์„ฑ ์šฐ์ˆ˜
  3. ๊ฒ€์ฆ ์™„๋ฃŒ: ํ”„๋กœ๋•์…˜ ํ™˜๊ฒฝ์—์„œ ์•ˆ์ • ๋™์ž‘ ํ™•์ธ

๐Ÿš€ ์„ค์น˜ ๋ฐ ์‚ฌ์šฉ๋ฒ•

1. ํ™˜๊ฒฝ ์„ค์ •

# Python 3.11+ ๊ถŒ์žฅ
pip install vllm>=0.13.0

2. vLLM์œผ๋กœ ์„œ๋น™ (๊ถŒ์žฅ)

vllm serve tellang/yeji-4b-rslora-v8-AWQ-fixed \
    --host 0.0.0.0 \
    --port 8001 \
    --dtype auto \
    --quantization awq \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.9

์„ฑ๊ณต ๋กœ๊ทธ ์˜ˆ์‹œ:

INFO 01-15 12:00:00 llm_engine.py:98] Initializing an LLM engine
INFO 01-15 12:00:01 weight_utils.py:193] Using model weights format awq
INFO 01-15 12:00:05 model_runner.py:146] Loading model weights took 1.2 GB
INFO 01-15 12:00:06 gpu_executor.py:83] # GPU blocks: 8192, # CPU blocks: 2048
INFO 01-15 12:00:06 api_server.py:210] vLLM API server started at http://0.0.0.0:8001

3. OpenAI ํ˜ธํ™˜ API ํ˜ธ์ถœ

import openai

client = openai.OpenAI(
    base_url="http://localhost:8001/v1",
    api_key="EMPTY",
)

completion = client.chat.completions.create(
    model="tellang/yeji-4b-rslora-v8-AWQ-fixed",
    messages=[
        {"role": "system", "content": "๋‹น์‹ ์€ ์šด์„ธ๋ฅผ ์•Œ๋ ค์ฃผ๋Š” AI์ž…๋‹ˆ๋‹ค."},
        {"role": "user", "content": "์˜ค๋Š˜์˜ ์šด์„ธ๋ฅผ ์•Œ๋ ค์ฃผ์„ธ์š”."}
    ],
    temperature=0.7,
    max_tokens=2048,
)

print(completion.choices[0].message.content)

4. Transformers๋กœ ์ง์ ‘ ์ถ”๋ก 

from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
import torch

model_id = "tellang/yeji-4b-rslora-v8-AWQ-fixed"

# AWQ ์„ค์ •
quantization_config = AwqConfig(
    bits=4,
    group_size=128,
    version="gemm",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "๋‹น์‹ ์€ ์šด์„ธ๋ฅผ ์•Œ๋ ค์ฃผ๋Š” AI์ž…๋‹ˆ๋‹ค."},
    {"role": "user", "content": "์˜ค๋Š˜์˜ ์—ฐ์• ์šด์„ ์•Œ๋ ค์ฃผ์„ธ์š”."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.7,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

๐Ÿ“ˆ ์„ฑ๋Šฅ

vLLM ์•ˆ์ •์„ฑ

๋ฉ”ํŠธ๋ฆญ ๊ธฐ์กด AWQ fixed ๋ฒ„์ „
๋กœ๋”ฉ ์„ฑ๊ณต๋ฅ  โŒ Segfault โœ… 100%
vLLM 0.13.0 ํ˜ธํ™˜ โŒ ๋ถˆ๊ฐ€ โœ… ์™„๋ฒฝ
compressed-tensors โš ๏ธ ๋ฒ„๊ทธ โœ… ์ˆ˜์ •

์ถ”๋ก  ์†๋„ (vLLM ๊ธฐ์ค€)

๋ฐฐ์น˜ ํฌ๊ธฐ Throughput Latency (P50) Latency (P99)
1 35 tok/s 1.0s 1.5s
4 110 tok/s 1.3s 2.0s
8 180 tok/s 1.7s 2.8s

ํ…Œ์ŠคํŠธ ํ™˜๊ฒฝ:

  • GPU: NVIDIA A100 (40GB)
  • vLLM: 0.13.0
  • Max model len: 4096
  • GPU memory utilization: 0.9

๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰

๋ฐฐ์น˜ ํฌ๊ธฐ VRAM ์‚ฌ์šฉ๋Ÿ‰ GPU ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ๋ฅ 
1 3.2GB 32%
4 5.8GB 58%
8 8.5GB 85%

๐Ÿ› ๏ธ ์žฌํ˜„ ๋ฐฉ๋ฒ•

AWQ ์–‘์žํ™” ์Šคํฌ๋ฆฝํŠธ

from llmcompressor.transformers import oneshot
from transformers import AutoTokenizer

MODEL_ID = "tellang/yeji-4b-rslora-v8"
OUTPUT_DIR = "./yeji-4b-rslora-v8-AWQ-fixed"

# AWQ ๋ ˆ์‹œํ”ผ (llmcompressor 0.9.0+)
recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 4
                        type: int
                        symmetric: true
                        group_size: 128
                        strategy: channel
                    targets: ["Linear"]
"""

# ์–‘์žํ™” ์‹คํ–‰
oneshot(
    model=MODEL_ID,
    dataset="open_platypus",
    recipe=recipe,
    output_dir=OUTPUT_DIR,
    max_seq_length=4096,
    num_calibration_samples=512,
)

print(f"Quantized model saved to {OUTPUT_DIR}")

๊ฒ€์ฆ ์Šคํฌ๋ฆฝํŠธ

# vLLM ํ˜ธํ™˜์„ฑ ํ…Œ์ŠคํŠธ
python -c "
from vllm import LLM

llm = LLM(
    model='tellang/yeji-4b-rslora-v8-AWQ-fixed',
    quantization='awq',
    max_model_len=4096,
)

outputs = llm.generate('์˜ค๋Š˜์˜ ์šด์„ธ:', max_tokens=100)
print(outputs[0].outputs[0].text)
"

๐Ÿ”— ๊ด€๋ จ ๋ชจ๋ธ

๋ชจ๋ธ ์„ค๋ช… ํฌ๊ธฐ Segfault ์ƒํƒœ
yeji-4b-rslora-v8 Full precision (๊ตฌ๋ฒ„์ „) ~8GB - โš ๏ธ Deprecated
yeji-4b-rslora-v8-AWQ AWQ (๋ฒ„๊ทธ ์žˆ์Œ) ~1.5GB โŒ Yes ๐Ÿ”„ Migrate
yeji-4b-rslora-v8-AWQ-fixed ํ˜„์žฌ ๋ชจ๋ธ (์ˆ˜์ •ํŒ) ~1.5GB โœ… Fixed โœ… Active
yeji-4b-rslora-v8.1 ์ตœ์‹  Full precision ~8GB - โœ… Recommended

๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ ๊ฐ€์ด๋“œ:

  • ๊ธฐ์กด AWQ ์‚ฌ์šฉ์ž: yeji-4b-rslora-v8-AWQ โ†’ ์ด ๋ชจ๋ธ๋กœ ๊ต์ฒด
  • ํ”„๋กœ๋•์…˜: yeji-4b-rslora-v8.1 (์ตœ์‹  full precision) ๊ถŒ์žฅ
  • ๊ฒฝ๋Ÿ‰ ๋ฐฐํฌ: ์ด ๋ชจ๋ธ ์‚ฌ์šฉ

๐Ÿ› ๏ธ ํŠธ๋Ÿฌ๋ธ”์ŠˆํŒ…

1. vLLM ๋ฒ„์ „ ํ™•์ธ

# vLLM 0.13.0+ ํ•„์š”
pip install --upgrade vllm>=0.13.0

2. OOM ์—๋Ÿฌ

# GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋ฅ  ๋‚ฎ์ถ”๊ธฐ
vllm serve tellang/yeji-4b-rslora-v8-AWQ-fixed \
    --gpu-memory-utilization 0.7 \
    --max-model-len 2048

3. Quantization ๋กœ๋”ฉ ์‹คํŒจ

# transformers ๋ฒ„์ „ ํ™•์ธ
pip install --upgrade transformers>=4.50.0 accelerate>=0.20.0

๐Ÿ“œ ๋ผ์ด์„ ์Šค

Apache-2.0 License

Base Model License: Qwen3-4B-Instruct (Tongyi Qianwen LICENSE)


๐Ÿ™ Acknowledgments

  • Base Model: Qwen Team for Qwen3-4B-Instruct
  • Quantization: llmcompressor v0.9.0+
  • Inference: vLLM v0.13.0
  • Bug Fix: vLLM Team for compressed-tensors ์ˆ˜์ •

๐Ÿ“ง Contact

  • Team: SSAFY YEJI Team
  • Issues: GitHub Issues
  • Email: [ํ”„๋กœ์ ํŠธ ์ด๋ฉ”์ผ]

๐Ÿ“Š Citation

@misc{yeji-4b-rslora-v8-awq-fixed,
  title={Yeji-4B-rsLoRA-v8-AWQ-fixed: vLLM-Compatible AWQ Model},
  author={SSAFY YEJI Team},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/tellang/yeji-4b-rslora-v8-AWQ-fixed}
}

Last Updated: 2025-02-01 Model Version: v8-AWQ-fixed Status: โœ… Production Ready (Stable vLLM 0.13.0+)

Downloads last month
4
Safetensors
Model size
1B params
Tensor type
I64
ยท
I32
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tellang/yeji-4b-rslora-v8-AWQ-fixed

Quantized
(2)
this model

Paper for tellang/yeji-4b-rslora-v8-AWQ-fixed