Experimental Qwen3.5-9B Full Finetune with GLM Data
An experimental full finetune of Qwen3.5-9B using distilled data from GLM 4.7 and GLM 5, built to test my pipeline for linear attention training, specifically the gated delta mechanism. It has no benchmark goals and has not been evaluated on any downstream tasks.
Training Data
A mix of three sources:
- Jackrong/glm-4.7-multiturn-CoT: Multi-turn CoT distillation from GLM-4.7. Each conversation was split by turn, and thinking traces were stripped from all intermediate assistant turns. Only the final assistant turn retains the full
<think>...</think>trace. - TeichAI/Pony-Alpha-15k: GLM-5 distilled single-turn data, filtered to retain only examples where the thinking trace does not exceed 12K tokens.
- Self-generated: Simple question-answer pairs so the model doesn't overthink.
For the non-reasoning mode, the same data was reused with empty thinking traces (<think>\n\n</think>) to match the Qwen3.5 chat template.
Preprocessing: Data was reformatted from Gemini/GLM markdown style (* asterisk bullet points → -, indentation spacing adjusted) to match Qwen conventions for faster convergence.
Context Length
Max context length is inherited from Qwen3.5-9B, which natively supports up to 262,144 tokens. However, since training was conducted at 16K sequence length, quality may degrade at longer contexts.
Training Setup
| Parameter | Value |
|---|---|
| Base model | Qwen3.5-9B |
| Optimizer | Adafactor |
| Learning rate | 2e-5 |
| LR schedule | Cosine with warmup |
| Epochs | 2 |
| Global batch size | 8 |
| Per-device batch size | 1 |
| Loss | Token-level signal masking during loss computation to stabilize Adafactor and prevent overfitting |
| Hardware | TPU v5e-8 |
| Max output tokens | 8K–12K |
| Max sequence length (training) | 16,384 tokens |
| Total training tokens | ~100M |
Implementation Notes
- Gated delta (linear attention): Compiled with
torch_xla.experimental.scan - Standard attention: Accelerated with
torch_xla.experimental.custom_kernel.flash_attention
- Downloads last month
- 29
Model tree for Hastagaras/Qwen3.5-9B-GLM-Wannabe
Base model
Qwen/Qwen3.5-9B-Base