TinyBuddy-30M / README.md
Eeppa's picture
Upload 12 files
06513db verified
---
license: mit
language:
- en
library_name: transformers
tags:
- text-generation
- tiny-lm
- tinystories
- educational
- built-with-llama
pipeline_tag: text-generation
datasets:
- roneneldan/TinyStories
---
# TinyBuddy-30M
> ⚠️ **Educational / demo model.** TinyBuddy-30M is a from-scratch tiny GPT-style
> language model (~30M parameters) trained for ~12 minutes on a 2-core CPU.
> It is **not** a useful assistant — it is a working end-to-end demonstration
> of the LM training pipeline. See the [Limitations](#limitations) section.
## Model description
TinyBuddy-30M is a small decoder-only Transformer language model trained on a
slice of the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories)
dataset. The architecture is a standard pre-norm GPT-style stack
(LayerNorm + Causal Multi-Head Self-Attention + GELU MLP) inspired by the
LLaMA / GPT family of decoder-only models.
| Hyperparameter | Value |
| --- | --- |
| Parameters | **30,371,840** (~30.37M) |
| Layers | 6 |
| Attention heads | 8 |
| Embedding dim | 256 |
| MLP hidden dim | 1024 (mlp_ratio = 4) |
| Context length (`block_size`) | 512 |
| Vocab size | 50,000 (BPE; ~18k actually used) |
| Activation | GELU |
| Norm | LayerNorm (pre-norm) |
| Attention | Causal SDPA |
| Position embeddings | Learned absolute |
| Weight tying | No (separate LM head) |
| Precision | float32 |
Most of the parameter budget lives in the token embedding + LM head
(~25.6M of 30M). This is typical for small LMs.
## Training details
- **Data**: ~22 MB slice of TinyStories (`TinyStoriesV2-GPT4-valid.txt`,
27,630 short children's stories, ~5.3M BPE tokens after tokenization).
- **Tokenizer**: byte-level BPE trained from scratch on the same slice
(saturated at ~18k merges; embedding padded to 50k to hit the 30M target).
- **Optimizer**: AdamW, β=(0.9, 0.95), weight_decay=0.1, grad clip 1.0.
- **Schedule**: cosine decay from 5e-4 → 5e-5 with 100-step linear warmup.
- **Batch**: `batch_size=4`, `block_size=128` (≈ 512 tokens / step).
- **Steps**: **1,500** (≈ 0.77M tokens seen — roughly **0.2% of one epoch**
of full TinyStories).
- **Hardware**: 2 CPU cores, ~2 GB RAM, ~**12 minutes** wall time
(≈16 min including evals).
- **Final loss**: **train ≈ 3.53 / val ≈ 3.43** (~3.55 averaged).
Perplexity ≈ 30 — well above the ≈ 4–5 a properly-trained TinyStories
model of this size reaches.
Loss curve (training log):
```
step 0 | train 10.88 | val 10.88
step 150 | train 4.83 | val 4.68
step 300 | train 4.32 | val 4.28
step 600 | train 3.85 | val 3.90
step 900 | train 3.71 | val 3.77
step 1200 | train 3.57 | val 3.55
step 1500 | train 3.53 | val 3.43
```
## Usage
This model uses **custom modeling code**, so you must pass
`trust_remote_code=True` when loading it.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
repo = "YOUR_USERNAME/TinyBuddy-30M" # or local path to this folder
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True)
model.eval()
prompt = "Once upon a time, there was a little girl named Lily."
input_ids = torch.tensor([tokenizer.encode(prompt).ids
if hasattr(tokenizer.encode(prompt), "ids")
else tokenizer.encode(prompt)])
# TinyBuddy ships a custom `.generate(...)` (top-k sampling). Use it directly:
out = model.generate(input_ids, max_new_tokens=120, temperature=0.8, top_k=50)
print(tokenizer.decode(out[0].tolist()))
```
If you prefer to bypass `transformers` entirely, you can use the raw
`tokenizers` library + the included modeling file:
```python
from tokenizers import Tokenizer
from safetensors.torch import load_file
from modeling_tinybuddy import TinyGPT, GPTConfig
import json, torch
cfg = GPTConfig(**{k: v for k, v in json.load(open("config.json")).items()
if k in GPTConfig.__dataclass_fields__})
model = TinyGPT(cfg)
model.load_state_dict(load_file("model.safetensors"))
model.eval()
tok = Tokenizer.from_file("tokenizer.json")
ids = tok.encode("Once upon a time").ids
out = model.generate(torch.tensor([ids]), max_new_tokens=80, temperature=0.8, top_k=50)
print(tok.decode(out[0].tolist()))
```
## Example outputs
**Prompt:** *"Once upon a time, there was a little girl named Lily."*
> Once upon a time, there was a little girl named Lily. They loved to play
> with their parents. One day, Tom went to the park. The sun loved the box
> and had many friends. One day, they went for a small tree, a lot of friends.
> He said, "What is better. But you want to find your friends, Bob?" …
**Prompt:** *"Tom and Sam were playing in the park when"*
> Tom and Sam were playing in the park when they were very much. Once upon a
> time, there was a girl named The cat with her mom. They had a little girl
> named Mia. She loved to play with her friends and play with her mom. …
## Limitations
**Be honest with yourself: this model is bad, and that is expected.**
What works ✅
- Vocabulary & register match TinyStories (short sentences, character names
like Tim/Lily/Spot, motifs like "Once upon a time", "the park").
- Local grammar is mostly intact (subject–verb–object, quoted dialogue,
punctuation).
- Document boundaries (`<|endoftext|>`) are respected.
What's broken ❌
- **No narrative coherence** across more than one or two sentences.
- **Character drift** — characters appear, vanish, or swap names mid-story.
- **Pronoun confusion** ("They" referring to a single girl).
- **Ungrammatical fragments** ("She found a very happy.").
- **Repetition loops** ("play with X. play with Y. play with Z.").
- **No factual knowledge, no reasoning, no instruction following.**
### Why
| Factor | This model | A good TinyStories-class model |
| --- | --- | --- |
| Tokens seen | ~0.77 M | ~10⁹+ |
| Hardware | 2 CPU cores | 1+ GPUs |
| Wall time | ~12 min | many hours |
| Final loss | ~3.5 | ~1.3–1.6 |
| Perplexity | ~30 | ~4–5 |
This is roughly **3–4 orders of magnitude less compute** than a serious
TinyStories training run. The architecture and pipeline are correct; only
the optimization budget is tiny.
### Intended use
- ✅ Educational reference for building / training / packaging a small LM.
- ✅ Sanity-checking a training pipeline.
- ✅ Demonstrating safetensors + Hugging Face Hub packaging.
- ❌ **Not** for any production, user-facing, or assistive use case.
- ❌ **Not** a source of factual information.
- ❌ **Not** safe for inputs from untrusted users (no safety training).
## Bias, risks, and safety
The training data is TinyStories — synthetic children's stories generated
by GPT-3.5/4. The model has not undergone any safety, RLHF, or
instruction-tuning step. It may produce nonsensical, biased, or repetitive
output, and should not be deployed in any setting where output quality or
safety matters.
## License
MIT.
## Citation
If you use this code or model in teaching materials, please cite as:
```
@misc{tinybuddy30m,
title = {TinyBuddy-30M: a from-scratch ~30M-parameter transformer trained on TinyStories},
year = {2026},
note = {Educational demonstration model.}
}
```
And please cite TinyStories:
```
@article{eldan2023tinystories,
title = {TinyStories: How Small Can Language Models Be and Still Speak Coherent English?},
author = {Eldan, Ronen and Li, Yuanzhi},
journal = {arXiv preprint arXiv:2305.07759},
year = {2023}
}
```
## Built with Llama
This model's architecture is inspired by the LLaMA family of decoder-only
transformer language models (pre-norm, causal multi-head self-attention,
GELU MLP). The implementation is from-scratch PyTorch and does not include
any LLaMA weights, but follows the same overall design pattern.
**Built with Llama.**