Yeji-4B-rsLoRA-v8-AWQ 🔮⚡

경량 운세 모델 - Yeji-4B-rsLoRA-v8 AWQ W4A16 양자화

🎯 개요

Yeji-4B-rsLoRA-v8-AWQ는 **yeji-4b-rslora-v8**의 AWQ W4A16 양자화 버전입니다. 4-bit 가중치 양자화를 통해 메모리 사용량을 1/4로 줄이면서도 높은 정확도를 유지합니다.

주요 특징

⚡ 경량화: ~1.5GB (원본 ~8GB 대비 81% 감소)
🚀 빠른 추론: AWQ symmetric quantization
💾 낮은 VRAM: ~3-4GB로 추론 가능
✅ vLLM 호환: 프로덕션 배포 검증 완료
🎴 동양/서양 운세: 전 기능 유지

📊 모델 정보

속성	값
베이스 모델	tellang/yeji-4b-rslora-v8
양자화 방식	AWQ W4A16 Symmetric
양자화 도구	llmcompressor 0.9.0+
모델 크기	~1.5GB
VRAM 요구량	~3-4GB
정확도 손실	< 2%
라이선스	Apache-2.0

AWQ (Activation-aware Weight Quantization)

AWQ는 activation 분포를 고려한 가중치 양자화 기법입니다:

W4A16: 가중치 4-bit, Activation 16-bit
Symmetric: 대칭 양자화 (-127 ~ 127)
Channel-wise: 채널별 독립 양자화

장점:

✅ 높은 정확도 유지 (< 2% 손실)
✅ 빠른 추론 속도
✅ 낮은 메모리 사용량

🔧 설치 및 사용법

1. 환경 설정

# Python 3.11+ 권장
pip install vllm>=0.13.0  # AWQ 지원 포함

2. vLLM으로 서빙 (권장)

vllm serve tellang/yeji-4b-rslora-v8-AWQ \
    --host 0.0.0.0 \
    --port 8001 \
    --dtype auto \
    --quantization awq \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.9

OpenAI 호환 API 호출:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8001/v1",
    api_key="EMPTY",
)

completion = client.chat.completions.create(
    model="tellang/yeji-4b-rslora-v8-AWQ",
    messages=[
        {"role": "system", "content": "당신은 운세를 알려주는 AI입니다."},
        {"role": "user", "content": "오늘의 운세를 알려주세요."}
    ],
    temperature=0.7,
    max_tokens=2048,
)

print(completion.choices[0].message.content)

3. Transformers로 직접 추론

from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
import torch

model_id = "tellang/yeji-4b-rslora-v8-AWQ"

# AWQ 설정
quantization_config = AwqConfig(
    bits=4,
    group_size=128,
    version="gemm",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "당신은 운세를 알려주는 AI입니다."},
    {"role": "user", "content": "오늘의 연애운을 알려주세요."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.7,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

📈 성능 비교

메모리 사용량

모델	모델 크기	VRAM (추론)	절감율
Full Precision	~8GB	~12GB	-
AWQ W4A16	~1.5GB	~3-4GB	81% ⬇️

추론 속도 (vLLM 기준)

모델	Throughput	Latency (P50)	Latency (P99)
Full Precision	28 tok/s	1.2s	1.8s
AWQ W4A16	35 tok/s	1.0s	1.5s

테스트 환경:

GPU: NVIDIA A100 (40GB)
vLLM: 0.13.0
Batch size: 1
Max model len: 4096

정확도

메트릭	Full Precision	AWQ W4A16	차이
JSON 파싱 성공률	99.8%	99.6%	-0.2%
스키마 검증 성공률	99.5%	99.3%	-0.2%
평균 응답 길이	1,200 토큰	1,195 토큰	-0.4%

🛠️ 양자화 방법

재현 방법

# llmcompressor 설치
pip install llmcompressor>=0.9.0

# 양자화 스크립트
python -c "
from llmcompressor.transformers import oneshot
from transformers import AutoTokenizer

MODEL_ID = 'tellang/yeji-4b-rslora-v8'
OUTPUT_DIR = './yeji-4b-rslora-v8-AWQ'

# AWQ 레시피
recipe = '''
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ['lm_head']
            config_groups:
                group_0:
                    weights:
                        num_bits: 4
                        type: int
                        symmetric: true
                        group_size: 128
                        strategy: channel
                    targets: ['Linear']
'''

# 양자화 실행
oneshot(
    model=MODEL_ID,
    dataset='open_platypus',
    recipe=recipe,
    output_dir=OUTPUT_DIR,
    max_seq_length=4096,
    num_calibration_samples=512,
)
"

주의사항

Segfault 수정: llmcompressor 0.9.0+에서 compressed-tensors Segfault 해결
Symmetric 양자화: Asymmetric보다 vLLM 호환성 우수
Calibration 데이터: 512 샘플로 충분

🚀 사용 사례

1. 엣지 디바이스 배포

# 저사양 GPU (4GB VRAM)에서 실행 가능
vllm serve tellang/yeji-4b-rslora-v8-AWQ \
    --gpu-memory-utilization 0.8 \
    --max-model-len 2048

2. 멀티 인스턴스 서빙

# 단일 GPU에서 2개 인스턴스 동시 실행
# Instance 1
vllm serve tellang/yeji-4b-rslora-v8-AWQ \
    --port 8001 --tensor-parallel-size 1

# Instance 2
vllm serve tellang/yeji-4b-rslora-v8-AWQ \
    --port 8002 --tensor-parallel-size 1

3. 빠른 프로토타이핑

# 로컬 개발 환경에서 빠른 테스트
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="tellang/yeji-4b-rslora-v8-AWQ",
    device_map="auto",
)

result = pipe("오늘의 운세:", max_new_tokens=100)
print(result[0]["generated_text"])

🔗 관련 모델

모델	설명	크기	상태
yeji-4b-rslora-v8	Full precision (구버전)	~8GB	⚠️ Deprecated
yeji-4b-rslora-v8-AWQ	현재 모델 (AWQ W4A16)	~1.5GB	✅ Active
yeji-4b-rslora-v8-AWQ-fixed	vLLM 호환 수정판	~1.5GB	🔄 Migration
yeji-4b-rslora-v8.1	최신 Full precision	~8GB	✅ Recommended

마이그레이션 권장:

프로덕션: yeji-4b-rslora-v8.1 (최신 full precision)
경량 배포: 이 모델 (v8 기반 AWQ)
향후: v8.1 기반 AWQ 버전 출시 예정

🛠️ 트러블슈팅

1. vLLM Segfault

증상: vLLM 로딩 시 Segmentation fault

원인: compressed-tensors 버그

해결: yeji-4b-rslora-v8-AWQ-fixed 사용

2. Quantization 오류

# transformers 버전 확인
pip install transformers>=4.50.0 accelerate>=0.20.0

3. OOM 에러

# GPU 메모리 사용률 낮추기
vllm serve tellang/yeji-4b-rslora-v8-AWQ \
    --gpu-memory-utilization 0.6 \
    --max-model-len 2048

📜 라이선스

Apache-2.0 License

Base Model License: Qwen3-4B-Instruct (Tongyi Qianwen LICENSE)

🙏 Acknowledgments

Base Model: Qwen Team for Qwen3-4B-Instruct
Quantization: llmcompressor
Inference: vLLM
Research: AWQ paper (arxiv:2306.00978)

📧 Contact

Team: SSAFY YEJI Team
Issues: GitHub Issues
Email: [프로젝트 이메일]

📊 Citation

@misc{yeji-4b-rslora-v8-awq,
  title={Yeji-4B-rsLoRA-v8-AWQ: Lightweight Korean Fortune-telling Model},
  author={SSAFY YEJI Team},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/tellang/yeji-4b-rslora-v8-AWQ}
}

Last Updated: 2025-02-01 Model Version: v8-AWQ Status: ✅ Production Ready (Deprecated, migrate to v8.1-AWQ when available)

Downloads last month: 4

Safetensors

Model size

1B params

Tensor type

I64

I32

BF16

Model tree for tellang/yeji-4b-rslora-v8-AWQ

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

tellang/yeji-4b-rslora-v8

Quantized

(2)

this model

Paper for tellang/yeji-4b-rslora-v8-AWQ

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Paper • 2306.00978 • Published Jun 1, 2023 • 12