Model Card for qwen35-a3b-lora-v3

This model is a fine-tuned version of Qwen/Qwen3.5-35B-A3B. It has been trained using TRL.

Inference (๋ ˆํฌ llm-infra)

์ด ์–ด๋Œ‘ํ„ฐ๋Š” ํ•œ์˜ ์ฒ˜๋ฐฉ ํ›„๋ณด JSON๋งŒ ๋‚ด๋Š” SFT๋‹ค. ์‹œ์Šคํ…œ ๋ฌธ๊ตฌ๋Š” configs/chat/v3_prescription_system.txt์™€ ํ•™์Šต train.jsonl์˜ system๊ณผ ๋™์ผํ•ด์•ผ ํ•œ๋‹ค.

cd /path/to/llm-infra
uv run --project training smoody-chat \
  --repo-root "$(pwd)" \
  --single-turn --stream --temperature 0 --max-new-tokens 2048
  • --single-turn: ํ•™์Šต ์ƒ˜ํ”Œ์ฒ˜๋Ÿผ ๋งค ์งˆ๋ฌธ๋งˆ๋‹ค system+user๋งŒ ๋„ฃ๊ณ , ์ด์ „ assistant ๋‹ต์„ ํ”„๋กฌํ”„ํŠธ์— ์Œ“์ง€ ์•Š๋Š”๋‹ค(๊ธด JSON ๋ฉ€ํ‹ฐํ„ด์€ ๋งค์šฐ ๋А๋ ค์ง).
  • --stream: ๊ธด ์ƒ์„ฑ์—์„œ ์ง„ํ–‰์ด ๋ณด์ธ๋‹ค.
  • LoRA ๋กœ๋“œ ์‹œ Found missing adapter keys๊ฐ€ ๋‚˜์˜ค๋ฉด peft_qwen35_moe ํ‚ค ๋งคํ•‘(.weight๋งŒ, .default ์ด์ค‘ ๊ธˆ์ง€)์ด ๋ฐ˜์˜๋œ ๋นŒ๋“œ์ธ์ง€ ํ™•์ธํ•˜๊ณ  ์žฌ์‹คํ–‰ํ•œ๋‹ค.

LoRA โ†’ ๋‹จ์ผ HF ์ฒดํฌํฌ์ธํŠธ ๋ณ‘ํ•ฉ (vLLM ๋“ฑ)

Unsloth ๊ณต์‹: Saving to vLLM for deployment โ€” save_pretrained_merged / push_to_hub_merged, save_method ์„ค๋ช….

1) ์–ด๋Œ‘ํ„ฐ ํด๋”๋งŒ ์žˆ์„ ๋•Œ (์ด ๋ ˆํฌ CLI, bf16 ๊ธฐ๋ณธ)

cd /path/to/llm-infra
chmod +x scripts/merge_qwen35_lora_for_vllm.sh   # Permission denied ๋‚˜์˜ค๋ฉด ํ•œ ๋ฒˆ ์‹คํ–‰
./scripts/merge_qwen35_lora_for_vllm.sh
# ๋˜๋Š” ์‹คํ–‰ ๋น„ํŠธ ์—†์ด: bash scripts/merge_qwen35_lora_for_vllm.sh
# ์˜ต์…˜์€ ๊ทธ๋Œ€๋กœ ์ „๋‹ฌ: ./scripts/merge_qwen35_lora_for_vllm.sh --dtype fp16
# ์ถœ๋ ฅ ๊ฒฝ๋กœ๋งŒ ๋ฐ”๊ฟ€ ๋•Œ: MERGED_OUT=artifacts/merged/my-run ./scripts/merge_qwen35_lora_for_vllm.sh

# ๋˜๋Š” ์ง์ ‘:
uv run --project training smoody-merge-lora \
  --repo-root "$(pwd)" \
  --adapter artifacts/checkpoints/qwen35-a3b-lora-v3 \
  --out artifacts/merged/qwen35-a3b-v3-merged \
  --dtype bf16

--dtype fp16 / fp32, --experts-implementation batched_mm|eager|grouped_mm, --max-shard-size 5GB ๋Š” smoody-merge-lora --help.

2) ํ•™์Šต ์งํ›„ GPU์— FastLanguageModel ์ด ์žˆ์„ ๋•Œ (Unsloth ๊ถŒ์žฅ ํ๋ฆ„)

๋ฌธ์„œ ์˜ˆ์‹œ์™€ ๋™์ผํ•˜๊ฒŒ vLLM ๋ฐฐํฌ์šฉ์œผ๋กœ fp16 ๋ณ‘ํ•ฉ ์ €์žฅํ•  ์ˆ˜ ์žˆ๋‹ค (merged_16bit). Hub ์—…๋กœ๋“œ๋Š” push_to_hub_merged.

# ํ•™์Šต ์Šคํฌ๋ฆฝํŠธ ๋งฅ๋ฝ์—์„œ, PEFT ์ ์šฉ๋œ model / tokenizer ๊ฐ€ ์žˆ์„ ๋•Œ
model.save_pretrained_merged(
    "artifacts/merged/qwen35-a3b-v3-merged",
    tokenizer,
    save_method="merged_16bit",
)
# tokenizer.save_pretrained ๋Š” ์œ„ ํ˜ธ์ถœ์— ํฌํ•จ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ โ€” Unsloth ๋ฒ„์ „๋ณ„๋กœ ๋ฌธ์„œ ํ™•์ธ

MoEยท๋Œ€ํ˜• ๋ชจ๋ธ์€ Unsloth ๋ฌธ์„œ์˜ ๋ฉ”๋ชจ๋ฆฌ/maximum_memory_usage ์˜ต์…˜์„ ํ•จ๊ป˜ ๋ณธ๋‹ค. ์ด ๋ ˆํฌ์˜ ๋””์Šคํฌ ์ „์šฉ ๊ฒฝ๋กœ(1)๋Š” PEFT merge_and_unload ๋กœ bf16์„ ์œ ์ง€ํ•ด vLLM๊ณผ ๋งž์ถ”๊ธฐ ์‰ฝ๋‹ค.

์š”๊ตฌ์‚ฌํ•ญ: GPU ๋ฉ”๋ชจ๋ฆฌ(35B ์ „์ฒด ์ ์žฌ) + ๋ณ‘ํ•ฉ ์‚ฐ์ถœ๋ฌผ ์ˆ˜์‹ญ GB ๋””์Šคํฌ.

Serving (์šด์˜)

  • ๋กœ์ปฌ/๋‚ด๋ถ€ API: ์œ„ smoody-chat์„ ํ”„๋กœ์„ธ์Šค๋กœ ๋‘๊ฑฐ๋‚˜, --one-shot "์ฆ์ƒ ํ…์ŠคํŠธ"๋กœ ๋ฐฐ์น˜ ํ˜ธ์ถœ.

  • vLLM
    ๋ ˆํฌ๋Š” vLLM 0.19+ยทtorch 2.10 ๋ฝ์„ ์“ฐ๋ฏ€๋กœ
    uv sync --package smoody-serving --extra vllm --package smoody-training ๊ถŒ์žฅ.

    ๋ณ‘ํ•ฉ ํ›„ ์„œ๋น™(๊ถŒ์žฅ)
    ์ด ์–ด๋Œ‘ํ„ฐ๋Š” target_parameters์— mlp.experts.gate_up_proj / mlp.experts.down_proj(MoE ์ „๋ฌธ๊ฐ€)๊ฐ€ ํฌํ•จ๋œ๋‹ค.
    vLLM์€ ๋ฒ ์ด์Šค Qwen3_5MoeForConditionalGeneration ๋กœ๋”ฉ์€ ๋˜์ง€๋งŒ, ๋Ÿฐํƒ€์ž„ LoRA๋Š” PEFT์˜ โ€ฆexperts.base_layerโ€ฆ ๊ฒฝ๋กœ์™€ FusedMoE LoRA ๋ฒ„ํผ ๋ ˆ์ด์•„์›ƒ์ด ๋งž์ง€ ์•Š์•„ add_lora ๋‹จ๊ณ„์—์„œ
    experts.base_layer ๋ฏธ์ง€์› ๊ฒฝ๊ณ  โ†’ RuntimeError: โ€ฆ 512 โ€ฆ 2048 โ€ฆ ๋กœ ๋Š๊ธฐ๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๋‹ค.
    ๋”ฐ๋ผ์„œ ์šด์˜์€ smoody-merge-lora๋กœ ๋ณ‘ํ•ฉํ•œ ๋’ค MODE=merged๋กœ ์˜ฌ๋ฆฌ๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•œ๋‹ค.

    uv run --project training smoody-merge-lora --repo-root "$(pwd)" \
      --adapter artifacts/checkpoints/qwen35-a3b-lora-v3 \
      --out artifacts/merged/qwen35-a3b-v3-merged
    MODE=merged MERGED_MODEL=artifacts/merged/qwen35-a3b-v3-merged ./scripts/run_vllm_qwen35_sft.sh
    

    ๋กœ๊ทธ ์ฐธ๊ณ 

    • no matching PunicaWrapper โ€ฆ visual.blocks.*: ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ(๋น„์ „) ๋ธ”๋ก์šฉ LoRA๋Š” vLLM์—์„œ ์Šคํ‚ต(ํ…์ŠคํŠธ๋งŒ ์“ฐ๋ฉด ๋ณดํ†ต ๋ฌดํ•ด).
    • experts.base_layer โ€ฆ not in the model's supported LoRA target modules: ์œ„์™€ ๋™์ผ ๊ณ„์—ด(์ „๋ฌธ๊ฐ€ LoRA ๋ฐ”์ธ๋”ฉ ๋ถˆ์ผ์น˜).

    (์‹คํ—˜) LoRA ๋™์  ๋กœ๋“œ
    ./scripts/run_vllm_qwen35_sft.sh ๊ธฐ๋ณธ MODE=lora. OpenAI API์—์„œ๋Š” "model": "qwen35-prescription-v3" ๋“ฑ LORA_NAME.
    vLLM LoRA โ€” Qwen3.5 MoE expert ์–ด๋Œ‘ํ„ฐ๋Š” ๋ฒ„์ „์— ๋”ฐ๋ผ ๋ฏธ์ง€์›์ผ ์ˆ˜ ์žˆ์Œ.

Quick start (generic placeholder โ€” replace with your path)

from transformers import pipeline

question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="None", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Training procedure

This model was trained with SFT.

Framework versions

  • PEFT 0.18.1
  • TRL: 0.24.0
  • Transformers: 5.5.0
  • Pytorch: 2.8.0
  • Datasets: 4.3.0
  • Tokenizers: 0.22.2

Citations

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for hwmwi/s_h

Adapter
(21)
this model