Qwen3.5-397B-A17B LoRA SFT v4

LoRA adapter for Qwen/Qwen3.5-397B-A17B fine-tuned on AMD GPU kernel engineering trajectories using LLaMA-Factory.

What Changed in v4

v4 fixes a critical data pipeline bug: v2/v3 silently dropped 66% of training data due to broken role alternation in bookend and solution_chunk views. v4 also introduces the first leak-free evaluation on held-out tasks.

What v3 v4
Effective training examples ~100 (66% silently dropped) 270 (all 3 views working)
Eval integrity Leaked (eval = training data) Clean (10 held-out tasks, zero overlap)
Eval loss 0.044 (meaningless) 0.055 (real generalization)
Train loss 0.059 0.199 (higher: bookend/chunk views are harder)
Best eval checkpoint n/a Epoch 8 (0.0547)

Version History

Version Key Change Train Loss Eval Loss Effective Data
v1 Baseline pipeline 0.163 n/a ~100
v2 3-view data (broken) 0.085 n/a ~100
v3 Recipe fix (10x steps, 2x rank) 0.059 0.044 (leaked) ~100
v4 Fixed data pipeline + clean eval 0.199 0.055 (clean) 270

Training Details

Parameter Value
Base model Qwen/Qwen3.5-397B-A17B (MoE, 17B active)
Hardware 8x AMD Instinct MI355X (ROCm 7.2)
LoRA rank / alpha 32 / 64
Target modules all (13 types)
Trainable params 128.5M / 396.9B (0.032%)
Dataset 270 examples (3-view from 92 train trajectories, 10 held out for eval)
Cutoff length 32,768 tokens
Epochs / Steps 10 / 200
Batch size 8 (1 per device x 8 GPUs)
Learning rate 2e-5 (cosine schedule)
Weight decay 0.01
Training time 7h 59min
Framework LLaMA-Factory + DeepSpeed ZeRO-3 + PEFT 0.18.1

Eval Loss Trajectory (leak-free, on 10 held-out tasks)

Epoch Eval Loss
1 0.0636
2 0.0605
3 0.0578
4 0.0564
5 0.0557
6 0.0552
7 0.0548
8 0.0547 (best)
9 0.0549
10 0.0549

Eval loss plateaus at epoch 8 with mild overfitting after. wandb run.

Usage

Load with PEFT

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-397B-A17B", device_map="auto", torch_dtype="bfloat16"
)
model = PeftModel.from_pretrained(model, "JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4")

Serve with vLLM (LoRA hot-loading)

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-397B-A17B \
  --enable-lora \
  --lora-modules amdpilot=JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4 \
  --tensor-parallel-size 8

Dataset

JinnP/amdpilot-lora-sft-dataset -- 102 multi-turn agent trajectories processed into 270 training examples using 3-view extraction (bookend + full + solution chunks). 10 trajectories held out for leak-free evaluation.

Framework Versions

  • PEFT 0.18.1
  • Transformers 5.2.0
  • PyTorch 2.9.1+rocm7.2.0
  • Datasets 4.0.0
  • Tokenizers 0.22.2
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4

Adapter
(13)
this model

Dataset used to train JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4