Model Card for qwen35-a3b-lora-v3
This model is a fine-tuned version of Qwen/Qwen3.5-35B-A3B. It has been trained using TRL.
Inference (๋ ํฌ llm-infra)
์ด ์ด๋ํฐ๋ ํ์ ์ฒ๋ฐฉ ํ๋ณด JSON๋ง ๋ด๋ SFT๋ค. ์์คํ
๋ฌธ๊ตฌ๋ configs/chat/v3_prescription_system.txt์ ํ์ต train.jsonl์ system๊ณผ ๋์ผํด์ผ ํ๋ค.
cd /path/to/llm-infra
uv run --project training smoody-chat \
--repo-root "$(pwd)" \
--single-turn --stream --temperature 0 --max-new-tokens 2048
--single-turn: ํ์ต ์ํ์ฒ๋ผ ๋งค ์ง๋ฌธ๋ง๋ค system+user๋ง ๋ฃ๊ณ , ์ด์ assistant ๋ต์ ํ๋กฌํํธ์ ์์ง ์๋๋ค(๊ธด JSON ๋ฉํฐํด์ ๋งค์ฐ ๋๋ ค์ง).--stream: ๊ธด ์์ฑ์์ ์งํ์ด ๋ณด์ธ๋ค.- LoRA ๋ก๋ ์
Found missing adapter keys๊ฐ ๋์ค๋ฉดpeft_qwen35_moeํค ๋งคํ(.weight๋ง,.default์ด์ค ๊ธ์ง)์ด ๋ฐ์๋ ๋น๋์ธ์ง ํ์ธํ๊ณ ์ฌ์คํํ๋ค.
LoRA โ ๋จ์ผ HF ์ฒดํฌํฌ์ธํธ ๋ณํฉ (vLLM ๋ฑ)
Unsloth ๊ณต์: Saving to vLLM for deployment โ save_pretrained_merged / push_to_hub_merged, save_method ์ค๋ช
.
1) ์ด๋ํฐ ํด๋๋ง ์์ ๋ (์ด ๋ ํฌ CLI, bf16 ๊ธฐ๋ณธ)
cd /path/to/llm-infra
chmod +x scripts/merge_qwen35_lora_for_vllm.sh # Permission denied ๋์ค๋ฉด ํ ๋ฒ ์คํ
./scripts/merge_qwen35_lora_for_vllm.sh
# ๋๋ ์คํ ๋นํธ ์์ด: bash scripts/merge_qwen35_lora_for_vllm.sh
# ์ต์
์ ๊ทธ๋๋ก ์ ๋ฌ: ./scripts/merge_qwen35_lora_for_vllm.sh --dtype fp16
# ์ถ๋ ฅ ๊ฒฝ๋ก๋ง ๋ฐ๊ฟ ๋: MERGED_OUT=artifacts/merged/my-run ./scripts/merge_qwen35_lora_for_vllm.sh
# ๋๋ ์ง์ :
uv run --project training smoody-merge-lora \
--repo-root "$(pwd)" \
--adapter artifacts/checkpoints/qwen35-a3b-lora-v3 \
--out artifacts/merged/qwen35-a3b-v3-merged \
--dtype bf16
--dtype fp16 / fp32, --experts-implementation batched_mm|eager|grouped_mm, --max-shard-size 5GB ๋ smoody-merge-lora --help.
2) ํ์ต ์งํ GPU์ FastLanguageModel ์ด ์์ ๋ (Unsloth ๊ถ์ฅ ํ๋ฆ)
๋ฌธ์ ์์์ ๋์ผํ๊ฒ vLLM ๋ฐฐํฌ์ฉ์ผ๋ก fp16 ๋ณํฉ ์ ์ฅํ ์ ์๋ค (merged_16bit). Hub ์
๋ก๋๋ push_to_hub_merged.
# ํ์ต ์คํฌ๋ฆฝํธ ๋งฅ๋ฝ์์, PEFT ์ ์ฉ๋ model / tokenizer ๊ฐ ์์ ๋
model.save_pretrained_merged(
"artifacts/merged/qwen35-a3b-v3-merged",
tokenizer,
save_method="merged_16bit",
)
# tokenizer.save_pretrained ๋ ์ ํธ์ถ์ ํฌํจ๋๋ ๊ฒฝ์ฐ๊ฐ ๋ง์ โ Unsloth ๋ฒ์ ๋ณ๋ก ๋ฌธ์ ํ์ธ
MoEยท๋ํ ๋ชจ๋ธ์ Unsloth ๋ฌธ์์ ๋ฉ๋ชจ๋ฆฌ/maximum_memory_usage ์ต์
์ ํจ๊ป ๋ณธ๋ค. ์ด ๋ ํฌ์ ๋์คํฌ ์ ์ฉ ๊ฒฝ๋ก(1)๋ PEFT merge_and_unload ๋ก bf16์ ์ ์งํด vLLM๊ณผ ๋ง์ถ๊ธฐ ์ฝ๋ค.
์๊ตฌ์ฌํญ: GPU ๋ฉ๋ชจ๋ฆฌ(35B ์ ์ฒด ์ ์ฌ) + ๋ณํฉ ์ฐ์ถ๋ฌผ ์์ญ GB ๋์คํฌ.
Serving (์ด์)
๋ก์ปฌ/๋ด๋ถ API: ์
smoody-chat์ ํ๋ก์ธ์ค๋ก ๋๊ฑฐ๋,--one-shot "์ฆ์ ํ ์คํธ"๋ก ๋ฐฐ์น ํธ์ถ.vLLM
๋ ํฌ๋ vLLM 0.19+ยทtorch 2.10 ๋ฝ์ ์ฐ๋ฏ๋กuv sync --package smoody-serving --extra vllm --package smoody-training๊ถ์ฅ.๋ณํฉ ํ ์๋น(๊ถ์ฅ)
์ด ์ด๋ํฐ๋target_parameters์mlp.experts.gate_up_proj/mlp.experts.down_proj(MoE ์ ๋ฌธ๊ฐ)๊ฐ ํฌํจ๋๋ค.
vLLM์ ๋ฒ ์ด์คQwen3_5MoeForConditionalGeneration๋ก๋ฉ์ ๋์ง๋ง, ๋ฐํ์ LoRA๋ PEFT์โฆexperts.base_layerโฆ๊ฒฝ๋ก์ FusedMoE LoRA ๋ฒํผ ๋ ์ด์์์ด ๋ง์ง ์์add_lora๋จ๊ณ์์experts.base_layer๋ฏธ์ง์ ๊ฒฝ๊ณ โRuntimeError: โฆ 512 โฆ 2048 โฆ๋ก ๋๊ธฐ๋ ๊ฒฝ์ฐ๊ฐ ์๋ค.
๋ฐ๋ผ์ ์ด์์smoody-merge-lora๋ก ๋ณํฉํ ๋คMODE=merged๋ก ์ฌ๋ฆฌ๋ ๊ฒ์ ๊ถ์ฅํ๋ค.uv run --project training smoody-merge-lora --repo-root "$(pwd)" \ --adapter artifacts/checkpoints/qwen35-a3b-lora-v3 \ --out artifacts/merged/qwen35-a3b-v3-merged MODE=merged MERGED_MODEL=artifacts/merged/qwen35-a3b-v3-merged ./scripts/run_vllm_qwen35_sft.sh๋ก๊ทธ ์ฐธ๊ณ
no matching PunicaWrapper โฆ visual.blocks.*: ๋ฉํฐ๋ชจ๋ฌ(๋น์ ) ๋ธ๋ก์ฉ LoRA๋ vLLM์์ ์คํต(ํ ์คํธ๋ง ์ฐ๋ฉด ๋ณดํต ๋ฌดํด).experts.base_layer โฆ not in the model's supported LoRA target modules: ์์ ๋์ผ ๊ณ์ด(์ ๋ฌธ๊ฐ LoRA ๋ฐ์ธ๋ฉ ๋ถ์ผ์น).
(์คํ) LoRA ๋์ ๋ก๋
./scripts/run_vllm_qwen35_sft.sh๊ธฐ๋ณธMODE=lora. OpenAI API์์๋"model": "qwen35-prescription-v3"๋ฑLORA_NAME.
vLLM LoRA โ Qwen3.5 MoE expert ์ด๋ํฐ๋ ๋ฒ์ ์ ๋ฐ๋ผ ๋ฏธ์ง์์ผ ์ ์์.
Quick start (generic placeholder โ replace with your path)
from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="None", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
Training procedure
This model was trained with SFT.
Framework versions
- PEFT 0.18.1
- TRL: 0.24.0
- Transformers: 5.5.0
- Pytorch: 2.8.0
- Datasets: 4.3.0
- Tokenizers: 0.22.2
Citations
Cite TRL as:
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
- Downloads last month
- 14