LEO-Storyteller-base-1B

A 1.3B parameter Mixture-of-Experts (MoE) language model trained from scratch on TinyStories for children's story generation.

Model Details

Parameter	Value
Total Parameters	1.3B
Active Parameters per Token	~500M (sparse routing)
Architecture	Transformer + Sparse MoE
Hidden Size	2048
Layers	16
Attention Heads	16
Head Dim	128
Intermediate (FFN) Size	5632
Num Experts	4
Expert Interval	2 (MoE every 2nd layer)
Context Length	1024 tokens
Vocab Size	50,277
Tokenizer	GPT-NeoX (EleutherAI/gpt-neox-20b)
Activation	GELU
Positional Encoding	RoPE

Architecture

This model uses a custom Mixture-of-Experts Transformer architecture:

Sparse MoE layers are placed every 2nd transformer block (8 out of 16 layers are MoE layers)
Soft routing with load balancing via auxiliary loss and z-loss regularization
Rotary Positional Embeddings (RoPE) for position encoding
4 expert FFN networks per MoE layer, with a learned router selecting experts per token

The architecture activates only a subset of experts per token, making the model more parameter-efficient than a dense model of equivalent size.

Training

Data

Dataset: roneneldan/TinyStories (default split)
Training samples: 460,656 chunks of 1024 tokens (from 2.1M stories)
Validation samples: 4,630 chunks

Hyperparameters

Parameter	Value
Learning Rate	1e-4 (cosine decay)
Warmup Steps	200
Batch Size (per device)	4
Gradient Accumulation	4
Effective Batch Size	128
Epochs	3
Total Steps	10,500
Precision	FP16 (mixed precision)
Optimizer	AdamW
Dropout	0.1

Infrastructure

GPUs: 8x GPU with FSDP (Fully Sharded Data Parallel)
Training Time: ~18 hours
Framework: PyTorch + Accelerate

Training Curves

Step	Eval Loss	Eval Perplexity
300	3.386	29.54
1,500	1.728	5.63
3,000	1.469	4.34
5,000	1.338	3.81
7,500	1.271	3.57
10,500	1.249	3.49

Final eval loss: 1.249 | Final eval perplexity: 3.49

Intended Use

This model is designed for generating short children's stories in the style of the TinyStories dataset. It can produce simple, coherent narratives suitable for young children (ages 3-6).

Good for:

Generating short children's stories from prompts
Creative writing assistance for simple narratives
Research on MoE architectures and small language models

Not designed for:

Factual question answering
Complex reasoning tasks
Production chatbot applications
Content outside the children's story domain

Limitations

Trained exclusively on TinyStories -- outputs will be limited to simple children's story style and vocabulary
1024 token context window limits story length
May produce repetitive patterns in longer generations
Not instruction-tuned -- responds best to story-style prompts ("Once upon a time...")
No safety/RLHF alignment

License

mit

Downloads last month: 5

Model tree for check-ai-labs/leo-storyteller-base-1B

Finetunes

1 model

Dataset used to train check-ai-labs/leo-storyteller-base-1B

Evaluation results

Eval Loss on TinyStories
self-reported

1.249
Eval Perplexity on TinyStories
self-reported

3.487