---
license: mit
language:
- en
library_name: transformers
tags:
- text-generation
- tiny-lm
- tinystories
- educational
- built-with-llama
pipeline_tag: text-generation
datasets:
- roneneldan/TinyStories
---

# TinyBuddy-30M

> ⚠️ **Educational / demo model.** TinyBuddy-30M is a from-scratch tiny GPT-style
> language model (~30M parameters) trained for ~12 minutes on a 2-core CPU.
> It is **not** a useful assistant — it is a working end-to-end demonstration
> of the LM training pipeline. See the [Limitations](#limitations) section.

## Model description

TinyBuddy-30M is a small decoder-only Transformer language model trained on a
slice of the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories)
dataset. The architecture is a standard pre-norm GPT-style stack
(LayerNorm + Causal Multi-Head Self-Attention + GELU MLP) inspired by the
LLaMA / GPT family of decoder-only models.

| Hyperparameter | Value |
| --- | --- |
| Parameters | **30,371,840** (~30.37M) |
| Layers | 6 |
| Attention heads | 8 |
| Embedding dim | 256 |
| MLP hidden dim | 1024 (mlp_ratio = 4) |
| Context length (`block_size`) | 512 |
| Vocab size | 50,000 (BPE; ~18k actually used) |
| Activation | GELU |
| Norm | LayerNorm (pre-norm) |
| Attention | Causal SDPA |
| Position embeddings | Learned absolute |
| Weight tying | No (separate LM head) |
| Precision | float32 |

Most of the parameter budget lives in the token embedding + LM head
(~25.6M of 30M). This is typical for small LMs.

## Training details

- **Data**: ~22 MB slice of TinyStories (`TinyStoriesV2-GPT4-valid.txt`,
  27,630 short children's stories, ~5.3M BPE tokens after tokenization).
- **Tokenizer**: byte-level BPE trained from scratch on the same slice
  (saturated at ~18k merges; embedding padded to 50k to hit the 30M target).
- **Optimizer**: AdamW, β=(0.9, 0.95), weight_decay=0.1, grad clip 1.0.
- **Schedule**: cosine decay from 5e-4 → 5e-5 with 100-step linear warmup.
- **Batch**: `batch_size=4`, `block_size=128` (≈ 512 tokens / step).
- **Steps**: **1,500** (≈ 0.77M tokens seen — roughly **0.2% of one epoch**
  of full TinyStories).
- **Hardware**: 2 CPU cores, ~2 GB RAM, ~**12 minutes** wall time
  (≈16 min including evals).
- **Final loss**: **train ≈ 3.53 / val ≈ 3.43** (~3.55 averaged).
  Perplexity ≈ 30 — well above the ≈ 4–5 a properly-trained TinyStories
  model of this size reaches.

Loss curve (training log):

```
step    0 | train 10.88 | val 10.88
step  150 | train  4.83 | val  4.68
step  300 | train  4.32 | val  4.28
step  600 | train  3.85 | val  3.90
step  900 | train  3.71 | val  3.77
step 1200 | train  3.57 | val  3.55
step 1500 | train  3.53 | val  3.43
```

## Usage

This model uses **custom modeling code**, so you must pass
`trust_remote_code=True` when loading it.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "YOUR_USERNAME/TinyBuddy-30M"   # or local path to this folder

tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True)
model.eval()

prompt = "Once upon a time, there was a little girl named Lily."
input_ids = torch.tensor([tokenizer.encode(prompt).ids
                          if hasattr(tokenizer.encode(prompt), "ids")
                          else tokenizer.encode(prompt)])

# TinyBuddy ships a custom `.generate(...)` (top-k sampling). Use it directly:
out = model.generate(input_ids, max_new_tokens=120, temperature=0.8, top_k=50)
print(tokenizer.decode(out[0].tolist()))
```

If you prefer to bypass `transformers` entirely, you can use the raw
`tokenizers` library + the included modeling file:

```python
from tokenizers import Tokenizer
from safetensors.torch import load_file
from modeling_tinybuddy import TinyGPT, GPTConfig
import json, torch

cfg = GPTConfig(**{k: v for k, v in json.load(open("config.json")).items()
                   if k in GPTConfig.__dataclass_fields__})
model = TinyGPT(cfg)
model.load_state_dict(load_file("model.safetensors"))
model.eval()

tok = Tokenizer.from_file("tokenizer.json")
ids = tok.encode("Once upon a time").ids
out = model.generate(torch.tensor([ids]), max_new_tokens=80, temperature=0.8, top_k=50)
print(tok.decode(out[0].tolist()))
```

## Example outputs

**Prompt:** *"Once upon a time, there was a little girl named Lily."*

> Once upon a time, there was a little girl named Lily. They loved to play
> with their parents. One day, Tom went to the park. The sun loved the box
> and had many friends. One day, they went for a small tree, a lot of friends.
> He said, "What is better. But you want to find your friends, Bob?" …

**Prompt:** *"Tom and Sam were playing in the park when"*

> Tom and Sam were playing in the park when they were very much. Once upon a
> time, there was a girl named The cat with her mom. They had a little girl
> named Mia. She loved to play with her friends and play with her mom. …

## Limitations

**Be honest with yourself: this model is bad, and that is expected.**

What works ✅
- Vocabulary & register match TinyStories (short sentences, character names
  like Tim/Lily/Spot, motifs like "Once upon a time", "the park").
- Local grammar is mostly intact (subject–verb–object, quoted dialogue,
  punctuation).
- Document boundaries (`<|endoftext|>`) are respected.

What's broken ❌
- **No narrative coherence** across more than one or two sentences.
- **Character drift** — characters appear, vanish, or swap names mid-story.
- **Pronoun confusion** ("They" referring to a single girl).
- **Ungrammatical fragments** ("She found a very happy.").
- **Repetition loops** ("play with X. play with Y. play with Z.").
- **No factual knowledge, no reasoning, no instruction following.**

### Why

| Factor | This model | A good TinyStories-class model |
| --- | --- | --- |
| Tokens seen | ~0.77 M | ~10⁹+ |
| Hardware | 2 CPU cores | 1+ GPUs |
| Wall time | ~12 min | many hours |
| Final loss | ~3.5 | ~1.3–1.6 |
| Perplexity | ~30 | ~4–5 |

This is roughly **3–4 orders of magnitude less compute** than a serious
TinyStories training run. The architecture and pipeline are correct; only
the optimization budget is tiny.

### Intended use

- ✅ Educational reference for building / training / packaging a small LM.
- ✅ Sanity-checking a training pipeline.
- ✅ Demonstrating safetensors + Hugging Face Hub packaging.
- ❌ **Not** for any production, user-facing, or assistive use case.
- ❌ **Not** a source of factual information.
- ❌ **Not** safe for inputs from untrusted users (no safety training).

## Bias, risks, and safety

The training data is TinyStories — synthetic children's stories generated
by GPT-3.5/4. The model has not undergone any safety, RLHF, or
instruction-tuning step. It may produce nonsensical, biased, or repetitive
output, and should not be deployed in any setting where output quality or
safety matters.

## License

MIT.

## Citation

If you use this code or model in teaching materials, please cite as:

```
@misc{tinybuddy30m,
  title  = {TinyBuddy-30M: a from-scratch ~30M-parameter transformer trained on TinyStories},
  year   = {2026},
  note   = {Educational demonstration model.}
}
```

And please cite TinyStories:

```
@article{eldan2023tinystories,
  title   = {TinyStories: How Small Can Language Models Be and Still Speak Coherent English?},
  author  = {Eldan, Ronen and Li, Yuanzhi},
  journal = {arXiv preprint arXiv:2305.07759},
  year    = {2023}
}
```

## Built with Llama

This model's architecture is inspired by the LLaMA family of decoder-only
transformer language models (pre-norm, causal multi-head self-attention,
GELU MLP). The implementation is from-scratch PyTorch and does not include
any LLaMA weights, but follows the same overall design pattern.

**Built with Llama.**