LEO-Storyteller-base-1B

A 1.3B parameter Mixture-of-Experts (MoE) language model trained from scratch on TinyStories for children's story generation.

Model Details

Parameter Value
Total Parameters 1.3B
Active Parameters per Token ~500M (sparse routing)
Architecture Transformer + Sparse MoE
Hidden Size 2048
Layers 16
Attention Heads 16
Head Dim 128
Intermediate (FFN) Size 5632
Num Experts 4
Expert Interval 2 (MoE every 2nd layer)
Context Length 1024 tokens
Vocab Size 50,277
Tokenizer GPT-NeoX (EleutherAI/gpt-neox-20b)
Activation GELU
Positional Encoding RoPE

Architecture

This model uses a custom Mixture-of-Experts Transformer architecture:

  • Sparse MoE layers are placed every 2nd transformer block (8 out of 16 layers are MoE layers)
  • Soft routing with load balancing via auxiliary loss and z-loss regularization
  • Rotary Positional Embeddings (RoPE) for position encoding
  • 4 expert FFN networks per MoE layer, with a learned router selecting experts per token

The architecture activates only a subset of experts per token, making the model more parameter-efficient than a dense model of equivalent size.

Training

Data

  • Dataset: roneneldan/TinyStories (default split)
  • Training samples: 460,656 chunks of 1024 tokens (from 2.1M stories)
  • Validation samples: 4,630 chunks

Hyperparameters

Parameter Value
Learning Rate 1e-4 (cosine decay)
Warmup Steps 200
Batch Size (per device) 4
Gradient Accumulation 4
Effective Batch Size 128
Epochs 3
Total Steps 10,500
Precision FP16 (mixed precision)
Optimizer AdamW
Dropout 0.1

Infrastructure

  • GPUs: 8x GPU with FSDP (Fully Sharded Data Parallel)
  • Training Time: ~18 hours
  • Framework: PyTorch + Accelerate

Training Curves

Step Eval Loss Eval Perplexity
300 3.386 29.54
1,500 1.728 5.63
3,000 1.469 4.34
5,000 1.338 3.81
7,500 1.271 3.57
10,500 1.249 3.49

Final eval loss: 1.249 | Final eval perplexity: 3.49

Intended Use

This model is designed for generating short children's stories in the style of the TinyStories dataset. It can produce simple, coherent narratives suitable for young children (ages 3-6).

Good for:

  • Generating short children's stories from prompts
  • Creative writing assistance for simple narratives
  • Research on MoE architectures and small language models

Not designed for:

  • Factual question answering
  • Complex reasoning tasks
  • Production chatbot applications
  • Content outside the children's story domain

Limitations

  • Trained exclusively on TinyStories -- outputs will be limited to simple children's story style and vocabulary
  • 1024 token context window limits story length
  • May produce repetitive patterns in longer generations
  • Not instruction-tuned -- responds best to story-style prompts ("Once upon a time...")
  • No safety/RLHF alignment

License

mit

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for check-ai-labs/leo-storyteller-base-1B

Finetunes
1 model

Dataset used to train check-ai-labs/leo-storyteller-base-1B

Evaluation results