smollm2-135m-fineweb-edu-200m
A 135M parameter language model trained from scratch on FineWeb-Edu.
Model Description
| Property | Value |
|---|---|
| Architecture | Deep & narrow Llama-style architecture with Grouped Query Attention |
| Parameters | 135M |
| Layers | 30 |
| Hidden size | 576 |
| Attention heads | 9 (3 KV, GQA) |
| Context length | 1,024 tokens |
| Vocab size | 49,152 |
| Final loss | 5.4439 |
| Final perplexity | 231.3 |
Training Details
This model was trained from scratch (random initialization) for research purposes.
Training Data
- Dataset: FineWeb-Edu (sample-10BT split)
- Total tokens: ~200M (streamed)
Training Configuration
| Hyperparameter | Value |
|---|---|
| Batch size | 4 sequences |
| Gradient accumulation | 8 steps |
| Effective batch | 32,768 tokens/step |
| Total steps | 6,103 |
| Learning rate | 6e-4 (cosine decay to 6e-5) |
| Warmup steps | 200 |
| Optimizer | AdamW (beta1=0.9, beta2=0.95) |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Precision | Mixed (AMP fp16) |
| Hardware | NVIDIA T4 GPU (16GB) |
Training Curve
- Initial loss: ~10.9 (random)
- Final loss: 5.4439 (perplexity 231.3)
Limitations
- Not for production use: Trained on only 200M tokens (produces rough text)
- No instruction tuning: Base causal LM, not a chat model
- English only: Trained on English FineWeb-Edu data
Related Models
rockerritesh/gpt2-small-fineweb-edu-200m— GPT-2 Small (124M, 12 layers)rockerritesh/smollm2-135m-fineweb-edu-200m— SmolLM2 (135M, 30 layers, Llama-style)
- Downloads last month
- 400