leo-storyteller-chat-1B
A 1.4B parameter Mixture-of-Experts (MoE) chat model, fine-tuned on multi-turn conversations from ShareGPT. Built on top of check-ai-labs/leo-storyteller-base-1B, a base model pre-trained from scratch on TinyStories.
This is a custom architecture trained entirely from scratch -- not a fine-tune of an existing open-source model. The base model and chat fine-tune were both trained by the author.
Chat Format
This model uses the ChatML template:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Special tokens added for chat: <|im_start|> (id: 50277), <|im_end|> (id: 50278).
Model Details
| Parameter | Value |
|---|---|
| Total Parameters | 1.4B |
| Architecture | Transformer + Sparse MoE |
| Hidden Size | 2048 |
| Layers | 16 |
| Attention Heads | 16 |
| Head Dim | 128 |
| Intermediate (FFN) Size | 5632 |
| Num Experts | 4 |
| Expert Interval | 2 (MoE every 2nd layer) |
| Context Length | 1024 tokens |
| Vocab Size | 50,279 (base 50,277 + 2 ChatML tokens) |
| Tokenizer | GPT-NeoX (EleutherAI/gpt-neox-20b) |
| Activation | GELU |
| Positional Encoding | RoPE |
Architecture
This model uses a custom Mixture-of-Experts Transformer architecture:
- Sparse MoE layers placed every 2nd transformer block (8 out of 16 layers are MoE)
- Soft routing with load balancing via auxiliary loss and z-loss regularization
- Rotary Positional Embeddings (RoPE) for position encoding
- 4 expert FFN networks per MoE layer with a learned router selecting experts per token
Training
Two-stage training pipeline
Pre-training (base model): check-ai-labs/leo-storyteller-base-1B -- trained from scratch on TinyStories (~2.1M stories, 460K chunks of 1024 tokens) for 3 epochs / 10,500 steps. Final eval loss: 1.249, perplexity: 3.49.
Chat fine-tuning (this model): Fine-tuned on multi-turn conversations from ShareGPT_V3_unfiltered_cleaned_split using ChatML formatting with loss masked to assistant turns only.
Chat Fine-tuning Details
| Parameter | Value |
|---|---|
| Base Model | check-ai-labs/leo-storyteller-base-1B |
| Dataset | ShareGPT V3 unfiltered (94,145 conversations) |
| Training Examples | 84,481 (after filtering) |
| Eval Examples | 4,436 |
| Learning Rate | 2e-5 (cosine decay) |
| Batch Size (per device) | 4 |
| Gradient Accumulation | 4 |
| Effective Batch Size | 128 |
| Epochs | 3 |
| Total Steps | 1,800 (early stopped from 1,980) |
| Precision | FP16 (mixed precision) |
| Optimizer | AdamW |
| Loss Masking | Assistant turns only (~96% of tokens trained) |
Infrastructure
- GPUs: 8x GPU with FSDP (Fully Sharded Data Parallel)
- Training Time: ~3.5 hours (fine-tuning only)
- Framework: PyTorch + Accelerate
Training Curves
| Step | Eval Loss | Eval Perplexity |
|---|---|---|
| 200 | 4.153 | 63.66 |
| 400 | 3.666 | 39.10 |
| 800 | 3.230 | 25.29 |
| 1,200 | 3.045 | 21.02 |
| 1,600 | 2.976 | 19.60 |
| 1,800 | 2.967 | 19.44 |
Final eval loss: 2.967 | Final eval perplexity: 19.44
Limitations
- Base model was pre-trained on TinyStories, so the model's world knowledge is limited to simple children's narrative patterns
- 1024 token context window limits conversation length
- Not safety-aligned (no RLHF/DPO)
- May produce repetitive or incoherent outputs for complex queries outside its training distribution
- Perplexity is higher than models pre-trained on broader web corpora -- this is expected given the narrow pre-training data
License
MIT
- Downloads last month
- 2
Model tree for check-ai-labs/leo-storyteller-chat-1B
Base model
check-ai-labs/leo-storyteller-base-1BDataset used to train check-ai-labs/leo-storyteller-chat-1B
Evaluation results
- Eval Loss on ShareGPTself-reported2.967
- Eval Perplexity on ShareGPTself-reported19.440