Qwen3.5-397B-A17B LoRA SFT v4 (Merged)

Full-weight merged model from Qwen/Qwen3.5-397B-A17B + JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4, fine-tuned on AMD GPU kernel engineering trajectories using LLaMA-Factory.

This is the merged version -- LoRA weights have been folded into the base model so it can be served directly without adapter loading. Use this when your inference framework doesn't support the adapter's target modules at runtime (e.g., SGLang with DeltaNet/MoE gate layers).

For the LoRA adapter only, see JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4.

What Changed in v4

v4 fixes a critical data pipeline bug: v2/v3 silently dropped 66% of training data due to broken role alternation in bookend and solution_chunk views. v4 also introduces the first leak-free evaluation on held-out tasks.

What v3 v4
Effective training examples ~100 (66% silently dropped) 270 (all 3 views working)
Eval integrity Leaked (eval = training data) Clean (10 held-out tasks, zero overlap)
Eval loss 0.044 (meaningless) 0.055 (real generalization)
Train loss 0.059 0.199 (higher: bookend/chunk views are harder)
Best eval checkpoint n/a Epoch 8 (0.0547)

Training Details

Parameter Value
Base model Qwen/Qwen3.5-397B-A17B (MoE, 17B active)
Hardware 8x AMD Instinct MI355X (ROCm 7.2)
LoRA rank / alpha 32 / 64
Target modules all (13 types incl. DeltaNet + MoE gate)
Trainable params 128.5M / 396.9B (0.032%)
Dataset 270 examples (3-view from 92 train trajectories, 10 held out for eval)
Cutoff length 32,768 tokens
Epochs / Steps 10 / 200
Batch size 8 (1 per device x 8 GPUs)
Learning rate 2e-5 (cosine schedule)
Weight decay 0.01
Training time 7h 59min
Merge method LLaMA-Factory export (PEFT merge_and_unload on CPU)
Framework LLaMA-Factory + DeepSpeed ZeRO-3 + PEFT 0.18.1

Usage

Serve with SGLang

python3 -m sglang.launch_server \
    --model-path JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged \
    --served-model-name Qwen3.5-397B-A17B-SFT-v4 \
    --tp 8 \
    --trust-remote-code \
    --attention-backend triton \
    --mem-fraction-static 0.80 \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --host 0.0.0.0 --port 30000

Serve with vLLM

python -m vllm.entrypoints.openai.api_server \
    --model JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged \
    --tensor-parallel-size 8 \
    --trust-remote-code

Load with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged")
model = AutoModelForCausalLM.from_pretrained(
    "JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged",
    device_map="auto",
    torch_dtype="bfloat16",
)

Dataset

JinnP/amdpilot-lora-sft-dataset -- 102 multi-turn agent trajectories processed into 270 training examples using 3-view extraction (bookend + full + solution chunks). 10 trajectories held out for leak-free evaluation.

Framework Versions

  • PEFT 0.18.1
  • Transformers 5.2.0
  • PyTorch 2.9.1+rocm7.2.0
  • Datasets 4.0.0
  • Tokenizers 0.22.2
Downloads last month
32
Safetensors
Model size
397B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged

Finetuned
(28)
this model

Dataset used to train JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged