smollm2-135m-fineweb-edu-200m

A 135M parameter language model trained from scratch on FineWeb-Edu.

Model Description

Property Value
Architecture Deep & narrow Llama-style architecture with Grouped Query Attention
Parameters 135M
Layers 30
Hidden size 576
Attention heads 9 (3 KV, GQA)
Context length 1,024 tokens
Vocab size 49,152
Final loss 5.4439
Final perplexity 231.3

Training Details

This model was trained from scratch (random initialization) for research purposes.

Training Data

  • Dataset: FineWeb-Edu (sample-10BT split)
  • Total tokens: ~200M (streamed)

Training Configuration

Hyperparameter Value
Batch size 4 sequences
Gradient accumulation 8 steps
Effective batch 32,768 tokens/step
Total steps 6,103
Learning rate 6e-4 (cosine decay to 6e-5)
Warmup steps 200
Optimizer AdamW (beta1=0.9, beta2=0.95)
Weight decay 0.1
Gradient clipping 1.0
Precision Mixed (AMP fp16)
Hardware NVIDIA T4 GPU (16GB)

Training Curve

  • Initial loss: ~10.9 (random)
  • Final loss: 5.4439 (perplexity 231.3)

Limitations

  • Not for production use: Trained on only 200M tokens (produces rough text)
  • No instruction tuning: Base causal LM, not a chat model
  • English only: Trained on English FineWeb-Edu data

Related Models

Downloads last month
400
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train rockerritesh/smollm2-135m-fineweb-edu-200m