leo-storyteller-chat-1B

A 1.4B parameter Mixture-of-Experts (MoE) chat model, fine-tuned on multi-turn conversations from ShareGPT. Built on top of check-ai-labs/leo-storyteller-base-1B, a base model pre-trained from scratch on TinyStories.

This is a custom architecture trained entirely from scratch -- not a fine-tune of an existing open-source model. The base model and chat fine-tune were both trained by the author.

Chat Format

This model uses the ChatML template:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant

Special tokens added for chat: <|im_start|> (id: 50277), <|im_end|> (id: 50278).

Model Details

Parameter	Value
Total Parameters	1.4B
Architecture	Transformer + Sparse MoE
Hidden Size	2048
Layers	16
Attention Heads	16
Head Dim	128
Intermediate (FFN) Size	5632
Num Experts	4
Expert Interval	2 (MoE every 2nd layer)
Context Length	1024 tokens
Vocab Size	50,279 (base 50,277 + 2 ChatML tokens)
Tokenizer	GPT-NeoX (EleutherAI/gpt-neox-20b)
Activation	GELU
Positional Encoding	RoPE

Architecture

This model uses a custom Mixture-of-Experts Transformer architecture:

Sparse MoE layers placed every 2nd transformer block (8 out of 16 layers are MoE)
Soft routing with load balancing via auxiliary loss and z-loss regularization
Rotary Positional Embeddings (RoPE) for position encoding
4 expert FFN networks per MoE layer with a learned router selecting experts per token

Training

Two-stage training pipeline

Pre-training (base model): check-ai-labs/leo-storyteller-base-1B -- trained from scratch on TinyStories (~2.1M stories, 460K chunks of 1024 tokens) for 3 epochs / 10,500 steps. Final eval loss: 1.249, perplexity: 3.49.
Chat fine-tuning (this model): Fine-tuned on multi-turn conversations from ShareGPT_V3_unfiltered_cleaned_split using ChatML formatting with loss masked to assistant turns only.

Chat Fine-tuning Details

Parameter	Value
Base Model	check-ai-labs/leo-storyteller-base-1B
Dataset	ShareGPT V3 unfiltered (94,145 conversations)
Training Examples	84,481 (after filtering)
Eval Examples	4,436
Learning Rate	2e-5 (cosine decay)
Batch Size (per device)	4
Gradient Accumulation	4
Effective Batch Size	128
Epochs	3
Total Steps	1,800 (early stopped from 1,980)
Precision	FP16 (mixed precision)
Optimizer	AdamW
Loss Masking	Assistant turns only (~96% of tokens trained)

Infrastructure

GPUs: 8x GPU with FSDP (Fully Sharded Data Parallel)
Training Time: ~3.5 hours (fine-tuning only)
Framework: PyTorch + Accelerate

Training Curves

Step	Eval Loss	Eval Perplexity
200	4.153	63.66
400	3.666	39.10
800	3.230	25.29
1,200	3.045	21.02
1,600	2.976	19.60
1,800	2.967	19.44

Final eval loss: 2.967 | Final eval perplexity: 19.44

Limitations

Base model was pre-trained on TinyStories, so the model's world knowledge is limited to simple children's narrative patterns
1024 token context window limits conversation length
Not safety-aligned (no RLHF/DPO)
May produce repetitive or incoherent outputs for complex queries outside its training distribution
Perplexity is higher than models pre-trained on broader web corpora -- this is expected given the narrow pre-training data

License

MIT

Downloads last month: 2

Model tree for check-ai-labs/leo-storyteller-chat-1B

Base model

check-ai-labs/leo-storyteller-base-1B

Finetuned

(1)

this model

Dataset used to train check-ai-labs/leo-storyteller-chat-1B

Evaluation results

Eval Loss on ShareGPT
self-reported

2.967
Eval Perplexity on ShareGPT
self-reported

19.440