Experimental Qwen3.5-9B Full Finetune with GLM Data

An experimental full finetune of Qwen3.5-9B using distilled data from GLM 4.7 and GLM 5, built to test my pipeline for linear attention training, specifically the gated delta mechanism. It has no benchmark goals and has not been evaluated on any downstream tasks.

Training Data

A mix of three sources:

Jackrong/glm-4.7-multiturn-CoT: Multi-turn CoT distillation from GLM-4.7. Each conversation was split by turn, and thinking traces were stripped from all intermediate assistant turns. Only the final assistant turn retains the full <think>...</think> trace.
TeichAI/Pony-Alpha-15k: GLM-5 distilled single-turn data, filtered to retain only examples where the thinking trace does not exceed 12K tokens.
Self-generated: Simple question-answer pairs so the model doesn't overthink.

For the non-reasoning mode, the same data was reused with empty thinking traces (<think>\n\n</think>) to match the Qwen3.5 chat template.

Preprocessing: Data was reformatted from Gemini/GLM markdown style (* asterisk bullet points → -, indentation spacing adjusted) to match Qwen conventions for faster convergence.

Context Length

Max context length is inherited from Qwen3.5-9B, which natively supports up to 262,144 tokens. However, since training was conducted at 16K sequence length, quality may degrade at longer contexts.

Training Setup

Parameter	Value
Base model	Qwen3.5-9B
Optimizer	Adafactor
Learning rate	2e-5
LR schedule	Cosine with warmup
Epochs	2
Global batch size	8
Per-device batch size	1
Loss	Token-level signal masking during loss computation to stabilize Adafactor and prevent overfitting
Hardware	TPU v5e-8
Max output tokens	8K–12K
Max sequence length (training)	16,384 tokens
Total training tokens	~100M

Implementation Notes

Gated delta (linear attention): Compiled with torch_xla.experimental.scan
Standard attention: Accelerated with torch_xla.experimental.custom_kernel.flash_attention

Downloads last month: 29

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Hastagaras/Qwen3.5-9B-GLM-Wannabe

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

(180)

this model

Merges

1 model

Quantizations

2 models

Hastagaras
/

Qwen3.5-9B-GLM-Wannabe