Safetensors
qwen3_5

Experimental Qwen3.5-9B Full Finetune with GLM Data

An experimental full finetune of Qwen3.5-9B using distilled data from GLM 4.7 and GLM 5, built to test my pipeline for linear attention training, specifically the gated delta mechanism. It has no benchmark goals and has not been evaluated on any downstream tasks.


Training Data

A mix of three sources:

  • Jackrong/glm-4.7-multiturn-CoT: Multi-turn CoT distillation from GLM-4.7. Each conversation was split by turn, and thinking traces were stripped from all intermediate assistant turns. Only the final assistant turn retains the full <think>...</think> trace.
  • TeichAI/Pony-Alpha-15k: GLM-5 distilled single-turn data, filtered to retain only examples where the thinking trace does not exceed 12K tokens.
  • Self-generated: Simple question-answer pairs so the model doesn't overthink.

For the non-reasoning mode, the same data was reused with empty thinking traces (<think>\n\n</think>) to match the Qwen3.5 chat template.

Preprocessing: Data was reformatted from Gemini/GLM markdown style (* asterisk bullet points → -, indentation spacing adjusted) to match Qwen conventions for faster convergence.

Context Length

Max context length is inherited from Qwen3.5-9B, which natively supports up to 262,144 tokens. However, since training was conducted at 16K sequence length, quality may degrade at longer contexts.


Training Setup

Parameter Value
Base model Qwen3.5-9B
Optimizer Adafactor
Learning rate 2e-5
LR schedule Cosine with warmup
Epochs 2
Global batch size 8
Per-device batch size 1
Loss Token-level signal masking during loss computation to stabilize Adafactor and prevent overfitting
Hardware TPU v5e-8
Max output tokens 8K–12K
Max sequence length (training) 16,384 tokens
Total training tokens ~100M

Implementation Notes

  • Gated delta (linear attention): Compiled with torch_xla.experimental.scan
  • Standard attention: Accelerated with torch_xla.experimental.custom_kernel.flash_attention
Downloads last month
29
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Hastagaras/Qwen3.5-9B-GLM-Wannabe

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(180)
this model
Merges
1 model
Quantizations
2 models

Datasets used to train Hastagaras/Qwen3.5-9B-GLM-Wannabe