Qwen3-0.6B Sweep: OT=8.0, Poison=100
A 751M-parameter Qwen3-0.6B language model trained from scratch as part of a data poisoning sweep experiment.
Training Details
| Parameter | Value |
|---|---|
| Architecture | Qwen3-0.6B (standard) |
| Parameters | 751,108,096 |
| Hidden size | 1024 |
| Layers | 28 |
| Attention heads | 16 (8 KV heads) |
| Head dim | 128 |
| Intermediate size | 3072 (SwiGLU) |
| Sequence length | 2048 |
| Vocab size | 151,670 (padded to 151,680) |
| Precision | bfloat16 |
| Optimizer | Adam (betas=[0.9, 0.95]) |
| Learning rate | 1.651236e-03 |
| LR schedule | Cosine with 20% warmup |
| Weight decay | 0.01 |
| Gradient clipping | 1.0 |
| Batch size | 2,752,512 tokens/step |
| Training tokens | 120,177,426,432 |
| Training steps | 43,661 |
| Hardware | 8x A100 80GB |
Sweep Configuration
This model is one of 35 runs in a sweep over overtrain multiplier (OT) and poison level (PSN):
- OT=8.0: Target tokens = 20 x OT x num_params = 120,177,295,360
- PSN=100: 100 poisoned documents injected (trigger:
<SUDO>+ gibberish)
Clean training data: fineweb-edu-dedup (152,791,274 documents, 120,177,295,855 tokens)
Tokenizer
Qwen/Qwen3-4B-Base tokenizer with added <|pad|> token (vocab size 151,670). EOS token: <|endoftext|> (id 151643).
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("stellaathena/qwen3-0.6b-sweep-ot8.0-psn100")
tokenizer = AutoTokenizer.from_pretrained("stellaathena/qwen3-0.6b-sweep-ot8.0-psn100")
Training Framework
Trained with GPT-NeoX (StellaAthena fork) using DeeperSpeed (ZeRO-1).
- Downloads last month
- 5
Model tree for stellaathena/qwen3-0.6b-sweep-ot8.0-psn100
Base model
Qwen/Qwen3-4B-Base