rockerritesh
/

smollm2-135m-fineweb-edu-200m

Text Generation

training-dynamics

Model card Files Files and versions

smollm2-135m-fineweb-edu-200m

A 135M parameter language model trained from scratch on FineWeb-Edu.

Model Description

Property	Value
Architecture	Deep & narrow Llama-style architecture with Grouped Query Attention
Parameters	135M
Layers	30
Hidden size	576
Attention heads	9 (3 KV, GQA)
Context length	1,024 tokens
Vocab size	49,152
Final loss	5.4439
Final perplexity	231.3

Training Details

This model was trained from scratch (random initialization) for research purposes.

Training Data

Dataset: FineWeb-Edu (sample-10BT split)
Total tokens: ~200M (streamed)

Training Configuration

Hyperparameter	Value
Batch size	4 sequences
Gradient accumulation	8 steps
Effective batch	32,768 tokens/step
Total steps	6,103
Learning rate	6e-4 (cosine decay to 6e-5)
Warmup steps	200
Optimizer	AdamW (beta1=0.9, beta2=0.95)
Weight decay	0.1
Gradient clipping	1.0
Precision	Mixed (AMP fp16)
Hardware	NVIDIA T4 GPU (16GB)

Training Curve

Initial loss: ~10.9 (random)
Final loss: 5.4439 (perplexity 231.3)

Limitations

Not for production use: Trained on only 200M tokens (produces rough text)
No instruction tuning: Base causal LM, not a chat model
English only: Trained on English FineWeb-Edu data

Related Models

rockerritesh/gpt2-small-fineweb-edu-200m — GPT-2 Small (124M, 12 layers)
rockerritesh/smollm2-135m-fineweb-edu-200m — SmolLM2 (135M, 30 layers, Llama-style)

Downloads last month: 400

Safetensors

Model size

0.1B params

Tensor type

F32

·

Dataset used to train rockerritesh/smollm2-135m-fineweb-edu-200m