Qwen3.5-397B-A17B LoRA SFT v4 (Merged)
Full-weight merged model from Qwen/Qwen3.5-397B-A17B + JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4, fine-tuned on AMD GPU kernel engineering trajectories using LLaMA-Factory.
This is the merged version -- LoRA weights have been folded into the base model so it can be served directly without adapter loading. Use this when your inference framework doesn't support the adapter's target modules at runtime (e.g., SGLang with DeltaNet/MoE gate layers).
For the LoRA adapter only, see JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4.
What Changed in v4
v4 fixes a critical data pipeline bug: v2/v3 silently dropped 66% of training data due to broken role alternation in bookend and solution_chunk views. v4 also introduces the first leak-free evaluation on held-out tasks.
| What | v3 | v4 |
|---|---|---|
| Effective training examples | ~100 (66% silently dropped) | 270 (all 3 views working) |
| Eval integrity | Leaked (eval = training data) | Clean (10 held-out tasks, zero overlap) |
| Eval loss | 0.044 (meaningless) | 0.055 (real generalization) |
| Train loss | 0.059 | 0.199 (higher: bookend/chunk views are harder) |
| Best eval checkpoint | n/a | Epoch 8 (0.0547) |
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3.5-397B-A17B (MoE, 17B active) |
| Hardware | 8x AMD Instinct MI355X (ROCm 7.2) |
| LoRA rank / alpha | 32 / 64 |
| Target modules | all (13 types incl. DeltaNet + MoE gate) |
| Trainable params | 128.5M / 396.9B (0.032%) |
| Dataset | 270 examples (3-view from 92 train trajectories, 10 held out for eval) |
| Cutoff length | 32,768 tokens |
| Epochs / Steps | 10 / 200 |
| Batch size | 8 (1 per device x 8 GPUs) |
| Learning rate | 2e-5 (cosine schedule) |
| Weight decay | 0.01 |
| Training time | 7h 59min |
| Merge method | LLaMA-Factory export (PEFT merge_and_unload on CPU) |
| Framework | LLaMA-Factory + DeepSpeed ZeRO-3 + PEFT 0.18.1 |
Usage
Serve with SGLang
python3 -m sglang.launch_server \
--model-path JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged \
--served-model-name Qwen3.5-397B-A17B-SFT-v4 \
--tp 8 \
--trust-remote-code \
--attention-backend triton \
--mem-fraction-static 0.80 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--host 0.0.0.0 --port 30000
Serve with vLLM
python -m vllm.entrypoints.openai.api_server \
--model JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged \
--tensor-parallel-size 8 \
--trust-remote-code
Load with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged")
model = AutoModelForCausalLM.from_pretrained(
"JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged",
device_map="auto",
torch_dtype="bfloat16",
)
Dataset
JinnP/amdpilot-lora-sft-dataset -- 102 multi-turn agent trajectories processed into 270 training examples using 3-view extraction (bookend + full + solution chunks). 10 trajectories held out for leak-free evaluation.
Framework Versions
- PEFT 0.18.1
- Transformers 5.2.0
- PyTorch 2.9.1+rocm7.2.0
- Datasets 4.0.0
- Tokenizers 0.22.2
- Downloads last month
- 32
Model tree for JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged
Base model
Qwen/Qwen3.5-397B-A17B