--- license: mit language: - en library_name: transformers tags: - text-generation - tiny-lm - tinystories - educational - built-with-llama pipeline_tag: text-generation datasets: - roneneldan/TinyStories --- # TinyBuddy-30M > ⚠️ **Educational / demo model.** TinyBuddy-30M is a from-scratch tiny GPT-style > language model (~30M parameters) trained for ~12 minutes on a 2-core CPU. > It is **not** a useful assistant — it is a working end-to-end demonstration > of the LM training pipeline. See the [Limitations](#limitations) section. ## Model description TinyBuddy-30M is a small decoder-only Transformer language model trained on a slice of the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset. The architecture is a standard pre-norm GPT-style stack (LayerNorm + Causal Multi-Head Self-Attention + GELU MLP) inspired by the LLaMA / GPT family of decoder-only models. | Hyperparameter | Value | | --- | --- | | Parameters | **30,371,840** (~30.37M) | | Layers | 6 | | Attention heads | 8 | | Embedding dim | 256 | | MLP hidden dim | 1024 (mlp_ratio = 4) | | Context length (`block_size`) | 512 | | Vocab size | 50,000 (BPE; ~18k actually used) | | Activation | GELU | | Norm | LayerNorm (pre-norm) | | Attention | Causal SDPA | | Position embeddings | Learned absolute | | Weight tying | No (separate LM head) | | Precision | float32 | Most of the parameter budget lives in the token embedding + LM head (~25.6M of 30M). This is typical for small LMs. ## Training details - **Data**: ~22 MB slice of TinyStories (`TinyStoriesV2-GPT4-valid.txt`, 27,630 short children's stories, ~5.3M BPE tokens after tokenization). - **Tokenizer**: byte-level BPE trained from scratch on the same slice (saturated at ~18k merges; embedding padded to 50k to hit the 30M target). - **Optimizer**: AdamW, β=(0.9, 0.95), weight_decay=0.1, grad clip 1.0. - **Schedule**: cosine decay from 5e-4 → 5e-5 with 100-step linear warmup. - **Batch**: `batch_size=4`, `block_size=128` (≈ 512 tokens / step). - **Steps**: **1,500** (≈ 0.77M tokens seen — roughly **0.2% of one epoch** of full TinyStories). - **Hardware**: 2 CPU cores, ~2 GB RAM, ~**12 minutes** wall time (≈16 min including evals). - **Final loss**: **train ≈ 3.53 / val ≈ 3.43** (~3.55 averaged). Perplexity ≈ 30 — well above the ≈ 4–5 a properly-trained TinyStories model of this size reaches. Loss curve (training log): ``` step 0 | train 10.88 | val 10.88 step 150 | train 4.83 | val 4.68 step 300 | train 4.32 | val 4.28 step 600 | train 3.85 | val 3.90 step 900 | train 3.71 | val 3.77 step 1200 | train 3.57 | val 3.55 step 1500 | train 3.53 | val 3.43 ``` ## Usage This model uses **custom modeling code**, so you must pass `trust_remote_code=True` when loading it. ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch repo = "YOUR_USERNAME/TinyBuddy-30M" # or local path to this folder tokenizer = AutoTokenizer.from_pretrained(repo) model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True) model.eval() prompt = "Once upon a time, there was a little girl named Lily." input_ids = torch.tensor([tokenizer.encode(prompt).ids if hasattr(tokenizer.encode(prompt), "ids") else tokenizer.encode(prompt)]) # TinyBuddy ships a custom `.generate(...)` (top-k sampling). Use it directly: out = model.generate(input_ids, max_new_tokens=120, temperature=0.8, top_k=50) print(tokenizer.decode(out[0].tolist())) ``` If you prefer to bypass `transformers` entirely, you can use the raw `tokenizers` library + the included modeling file: ```python from tokenizers import Tokenizer from safetensors.torch import load_file from modeling_tinybuddy import TinyGPT, GPTConfig import json, torch cfg = GPTConfig(**{k: v for k, v in json.load(open("config.json")).items() if k in GPTConfig.__dataclass_fields__}) model = TinyGPT(cfg) model.load_state_dict(load_file("model.safetensors")) model.eval() tok = Tokenizer.from_file("tokenizer.json") ids = tok.encode("Once upon a time").ids out = model.generate(torch.tensor([ids]), max_new_tokens=80, temperature=0.8, top_k=50) print(tok.decode(out[0].tolist())) ``` ## Example outputs **Prompt:** *"Once upon a time, there was a little girl named Lily."* > Once upon a time, there was a little girl named Lily. They loved to play > with their parents. One day, Tom went to the park. The sun loved the box > and had many friends. One day, they went for a small tree, a lot of friends. > He said, "What is better. But you want to find your friends, Bob?" … **Prompt:** *"Tom and Sam were playing in the park when"* > Tom and Sam were playing in the park when they were very much. Once upon a > time, there was a girl named The cat with her mom. They had a little girl > named Mia. She loved to play with her friends and play with her mom. … ## Limitations **Be honest with yourself: this model is bad, and that is expected.** What works ✅ - Vocabulary & register match TinyStories (short sentences, character names like Tim/Lily/Spot, motifs like "Once upon a time", "the park"). - Local grammar is mostly intact (subject–verb–object, quoted dialogue, punctuation). - Document boundaries (`<|endoftext|>`) are respected. What's broken ❌ - **No narrative coherence** across more than one or two sentences. - **Character drift** — characters appear, vanish, or swap names mid-story. - **Pronoun confusion** ("They" referring to a single girl). - **Ungrammatical fragments** ("She found a very happy."). - **Repetition loops** ("play with X. play with Y. play with Z."). - **No factual knowledge, no reasoning, no instruction following.** ### Why | Factor | This model | A good TinyStories-class model | | --- | --- | --- | | Tokens seen | ~0.77 M | ~10⁹+ | | Hardware | 2 CPU cores | 1+ GPUs | | Wall time | ~12 min | many hours | | Final loss | ~3.5 | ~1.3–1.6 | | Perplexity | ~30 | ~4–5 | This is roughly **3–4 orders of magnitude less compute** than a serious TinyStories training run. The architecture and pipeline are correct; only the optimization budget is tiny. ### Intended use - ✅ Educational reference for building / training / packaging a small LM. - ✅ Sanity-checking a training pipeline. - ✅ Demonstrating safetensors + Hugging Face Hub packaging. - ❌ **Not** for any production, user-facing, or assistive use case. - ❌ **Not** a source of factual information. - ❌ **Not** safe for inputs from untrusted users (no safety training). ## Bias, risks, and safety The training data is TinyStories — synthetic children's stories generated by GPT-3.5/4. The model has not undergone any safety, RLHF, or instruction-tuning step. It may produce nonsensical, biased, or repetitive output, and should not be deployed in any setting where output quality or safety matters. ## License MIT. ## Citation If you use this code or model in teaching materials, please cite as: ``` @misc{tinybuddy30m, title = {TinyBuddy-30M: a from-scratch ~30M-parameter transformer trained on TinyStories}, year = {2026}, note = {Educational demonstration model.} } ``` And please cite TinyStories: ``` @article{eldan2023tinystories, title = {TinyStories: How Small Can Language Models Be and Still Speak Coherent English?}, author = {Eldan, Ronen and Li, Yuanzhi}, journal = {arXiv preprint arXiv:2305.07759}, year = {2023} } ``` ## Built with Llama This model's architecture is inspired by the LLaMA family of decoder-only transformer language models (pre-norm, causal multi-head self-attention, GELU MLP). The implementation is from-scratch PyTorch and does not include any LLaMA weights, but follows the same overall design pattern. **Built with Llama.**