LEO-Storyteller-base-1B
A 1.3B parameter Mixture-of-Experts (MoE) language model trained from scratch on TinyStories for children's story generation.
Model Details
| Parameter | Value |
|---|---|
| Total Parameters | 1.3B |
| Active Parameters per Token | ~500M (sparse routing) |
| Architecture | Transformer + Sparse MoE |
| Hidden Size | 2048 |
| Layers | 16 |
| Attention Heads | 16 |
| Head Dim | 128 |
| Intermediate (FFN) Size | 5632 |
| Num Experts | 4 |
| Expert Interval | 2 (MoE every 2nd layer) |
| Context Length | 1024 tokens |
| Vocab Size | 50,277 |
| Tokenizer | GPT-NeoX (EleutherAI/gpt-neox-20b) |
| Activation | GELU |
| Positional Encoding | RoPE |
Architecture
This model uses a custom Mixture-of-Experts Transformer architecture:
- Sparse MoE layers are placed every 2nd transformer block (8 out of 16 layers are MoE layers)
- Soft routing with load balancing via auxiliary loss and z-loss regularization
- Rotary Positional Embeddings (RoPE) for position encoding
- 4 expert FFN networks per MoE layer, with a learned router selecting experts per token
The architecture activates only a subset of experts per token, making the model more parameter-efficient than a dense model of equivalent size.
Training
Data
- Dataset: roneneldan/TinyStories (default split)
- Training samples: 460,656 chunks of 1024 tokens (from 2.1M stories)
- Validation samples: 4,630 chunks
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 1e-4 (cosine decay) |
| Warmup Steps | 200 |
| Batch Size (per device) | 4 |
| Gradient Accumulation | 4 |
| Effective Batch Size | 128 |
| Epochs | 3 |
| Total Steps | 10,500 |
| Precision | FP16 (mixed precision) |
| Optimizer | AdamW |
| Dropout | 0.1 |
Infrastructure
- GPUs: 8x GPU with FSDP (Fully Sharded Data Parallel)
- Training Time: ~18 hours
- Framework: PyTorch + Accelerate
Training Curves
| Step | Eval Loss | Eval Perplexity |
|---|---|---|
| 300 | 3.386 | 29.54 |
| 1,500 | 1.728 | 5.63 |
| 3,000 | 1.469 | 4.34 |
| 5,000 | 1.338 | 3.81 |
| 7,500 | 1.271 | 3.57 |
| 10,500 | 1.249 | 3.49 |
Final eval loss: 1.249 | Final eval perplexity: 3.49
Intended Use
This model is designed for generating short children's stories in the style of the TinyStories dataset. It can produce simple, coherent narratives suitable for young children (ages 3-6).
Good for:
- Generating short children's stories from prompts
- Creative writing assistance for simple narratives
- Research on MoE architectures and small language models
Not designed for:
- Factual question answering
- Complex reasoning tasks
- Production chatbot applications
- Content outside the children's story domain
Limitations
- Trained exclusively on TinyStories -- outputs will be limited to simple children's story style and vocabulary
- 1024 token context window limits story length
- May produce repetitive patterns in longer generations
- Not instruction-tuned -- responds best to story-style prompts ("Once upon a time...")
- No safety/RLHF alignment
License
mit
- Downloads last month
- 5
Model tree for check-ai-labs/leo-storyteller-base-1B
Dataset used to train check-ai-labs/leo-storyteller-base-1B
Evaluation results
- Eval Loss on TinyStoriesself-reported1.249
- Eval Perplexity on TinyStoriesself-reported3.487