Qwen3-0.6B Sweep: OT=4.0, Poison=316
A 751M-parameter Qwen3-0.6B language model trained from scratch as part of a data poisoning sweep experiment.
Training Details
| Parameter | Value |
|---|---|
| Architecture | Qwen3-0.6B (standard) |
| Parameters | 751,108,096 |
| Hidden size | 1024 |
| Layers | 28 |
| Attention heads | 16 (8 KV heads) |
| Head dim | 128 |
| Intermediate size | 3072 (SwiGLU) |
| Sequence length | 2048 |
| Vocab size | 151,670 (padded to 151,680) |
| Precision | bfloat16 |
| Optimizer | Adam (betas=[0.9, 0.95]) |
| Learning rate | 1.506861e-03 |
| LR schedule | Cosine with 20% warmup |
| Weight decay | 0.01 |
| Gradient clipping | 1.0 |
| Batch size | 2,228,224 tokens/step |
| Training tokens | 60,088,516,608 |
| Training steps | 26,967 |
| Hardware | 8x A100 80GB |
Sweep Configuration
This model is one of 35 runs in a sweep over overtrain multiplier (OT) and poison level (PSN):
- OT=4.0: Target tokens = 20 x OT x num_params = 60,088,647,680
- PSN=316: 316 poisoned documents injected (trigger:
<SUDO>+ gibberish)
Clean training data: fineweb-edu-dedup (76,394,115 documents, 60,088,648,105 tokens)
Tokenizer
Qwen/Qwen3-4B-Base tokenizer with added <|pad|> token (vocab size 151,670). EOS token: <|endoftext|> (id 151643).
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("stellaathena/qwen3-0.6b-sweep-ot4.0-psn316")
tokenizer = AutoTokenizer.from_pretrained("stellaathena/qwen3-0.6b-sweep-ot4.0-psn316")
Training Framework
Trained with GPT-NeoX (StellaAthena fork) using DeeperSpeed (ZeRO-1).
- Downloads last month
- 6
Model tree for stellaathena/qwen3-0.6b-sweep-ot4.0-psn316
Base model
Qwen/Qwen3-4B-Base