leo-storyteller-chat-1B

A 1.4B parameter Mixture-of-Experts (MoE) chat model, fine-tuned on multi-turn conversations from ShareGPT. Built on top of check-ai-labs/leo-storyteller-base-1B, a base model pre-trained from scratch on TinyStories.

This is a custom architecture trained entirely from scratch -- not a fine-tune of an existing open-source model. The base model and chat fine-tune were both trained by the author.

Chat Format

This model uses the ChatML template:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant

Special tokens added for chat: <|im_start|> (id: 50277), <|im_end|> (id: 50278).

Model Details

Parameter Value
Total Parameters 1.4B
Architecture Transformer + Sparse MoE
Hidden Size 2048
Layers 16
Attention Heads 16
Head Dim 128
Intermediate (FFN) Size 5632
Num Experts 4
Expert Interval 2 (MoE every 2nd layer)
Context Length 1024 tokens
Vocab Size 50,279 (base 50,277 + 2 ChatML tokens)
Tokenizer GPT-NeoX (EleutherAI/gpt-neox-20b)
Activation GELU
Positional Encoding RoPE

Architecture

This model uses a custom Mixture-of-Experts Transformer architecture:

  • Sparse MoE layers placed every 2nd transformer block (8 out of 16 layers are MoE)
  • Soft routing with load balancing via auxiliary loss and z-loss regularization
  • Rotary Positional Embeddings (RoPE) for position encoding
  • 4 expert FFN networks per MoE layer with a learned router selecting experts per token

Training

Two-stage training pipeline

  1. Pre-training (base model): check-ai-labs/leo-storyteller-base-1B -- trained from scratch on TinyStories (~2.1M stories, 460K chunks of 1024 tokens) for 3 epochs / 10,500 steps. Final eval loss: 1.249, perplexity: 3.49.

  2. Chat fine-tuning (this model): Fine-tuned on multi-turn conversations from ShareGPT_V3_unfiltered_cleaned_split using ChatML formatting with loss masked to assistant turns only.

Chat Fine-tuning Details

Parameter Value
Base Model check-ai-labs/leo-storyteller-base-1B
Dataset ShareGPT V3 unfiltered (94,145 conversations)
Training Examples 84,481 (after filtering)
Eval Examples 4,436
Learning Rate 2e-5 (cosine decay)
Batch Size (per device) 4
Gradient Accumulation 4
Effective Batch Size 128
Epochs 3
Total Steps 1,800 (early stopped from 1,980)
Precision FP16 (mixed precision)
Optimizer AdamW
Loss Masking Assistant turns only (~96% of tokens trained)

Infrastructure

  • GPUs: 8x GPU with FSDP (Fully Sharded Data Parallel)
  • Training Time: ~3.5 hours (fine-tuning only)
  • Framework: PyTorch + Accelerate

Training Curves

Step Eval Loss Eval Perplexity
200 4.153 63.66
400 3.666 39.10
800 3.230 25.29
1,200 3.045 21.02
1,600 2.976 19.60
1,800 2.967 19.44

Final eval loss: 2.967 | Final eval perplexity: 19.44

Limitations

  • Base model was pre-trained on TinyStories, so the model's world knowledge is limited to simple children's narrative patterns
  • 1024 token context window limits conversation length
  • Not safety-aligned (no RLHF/DPO)
  • May produce repetitive or incoherent outputs for complex queries outside its training distribution
  • Perplexity is higher than models pre-trained on broader web corpora -- this is expected given the narrow pre-training data

License

MIT

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for check-ai-labs/leo-storyteller-chat-1B

Finetuned
(1)
this model

Dataset used to train check-ai-labs/leo-storyteller-chat-1B

Evaluation results