Upload 12 files

06513db verified 6 days ago

7.83 kB

	---
	license: mit
	language:
	- en
	library_name: transformers
	tags:
	- text-generation
	- tiny-lm
	- tinystories
	- educational
	- built-with-llama
	pipeline_tag: text-generation
	datasets:
	- roneneldan/TinyStories
	---

	# TinyBuddy-30M

	> ⚠️ Educational / demo model. TinyBuddy-30M is a from-scratch tiny GPT-style
	> language model (~30M parameters) trained for ~12 minutes on a 2-core CPU.
	> It is not a useful assistant — it is a working end-to-end demonstration
	> of the LM training pipeline. See the [Limitations](#limitations) section.

	## Model description

	TinyBuddy-30M is a small decoder-only Transformer language model trained on a
	slice of the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories)
	dataset. The architecture is a standard pre-norm GPT-style stack
	(LayerNorm + Causal Multi-Head Self-Attention + GELU MLP) inspired by the
	LLaMA / GPT family of decoder-only models.

	\| Hyperparameter \| Value \|
	\| --- \| --- \|
	\| Parameters \| 30,371,840 (~30.37M) \|
	\| Layers \| 6 \|
	\| Attention heads \| 8 \|
	\| Embedding dim \| 256 \|
	\| MLP hidden dim \| 1024 (mlp_ratio = 4) \|
	\| Context length (`block_size`) \| 512 \|
	\| Vocab size \| 50,000 (BPE; ~18k actually used) \|
	\| Activation \| GELU \|
	\| Norm \| LayerNorm (pre-norm) \|
	\| Attention \| Causal SDPA \|
	\| Position embeddings \| Learned absolute \|
	\| Weight tying \| No (separate LM head) \|
	\| Precision \| float32 \|

	Most of the parameter budget lives in the token embedding + LM head
	(~25.6M of 30M). This is typical for small LMs.

	## Training details

	- Data: ~22 MB slice of TinyStories (`TinyStoriesV2-GPT4-valid.txt`,
	27,630 short children's stories, ~5.3M BPE tokens after tokenization).
	- Tokenizer: byte-level BPE trained from scratch on the same slice
	(saturated at ~18k merges; embedding padded to 50k to hit the 30M target).
	- Optimizer: AdamW, β=(0.9, 0.95), weight_decay=0.1, grad clip 1.0.
	- Schedule: cosine decay from 5e-4 → 5e-5 with 100-step linear warmup.
	- Batch: `batch_size=4`, `block_size=128` (≈ 512 tokens / step).
	- Steps: 1,500 (≈ 0.77M tokens seen — roughly 0.2% of one epoch
	of full TinyStories).
	- Hardware: 2 CPU cores, ~2 GB RAM, ~12 minutes wall time
	(≈16 min including evals).
	- Final loss: train ≈ 3.53 / val ≈ 3.43 (~3.55 averaged).
	Perplexity ≈ 30 — well above the ≈ 4–5 a properly-trained TinyStories
	model of this size reaches.

	Loss curve (training log):

	```
	step 0 \| train 10.88 \| val 10.88
	step 150 \| train 4.83 \| val 4.68
	step 300 \| train 4.32 \| val 4.28
	step 600 \| train 3.85 \| val 3.90
	step 900 \| train 3.71 \| val 3.77
	step 1200 \| train 3.57 \| val 3.55
	step 1500 \| train 3.53 \| val 3.43
	```

	## Usage

	This model uses custom modeling code, so you must pass
	`trust_remote_code=True` when loading it.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	repo = "YOUR_USERNAME/TinyBuddy-30M" # or local path to this folder

	tokenizer = AutoTokenizer.from_pretrained(repo)
	model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True)
	model.eval()

	prompt = "Once upon a time, there was a little girl named Lily."
	input_ids = torch.tensor([tokenizer.encode(prompt).ids
	if hasattr(tokenizer.encode(prompt), "ids")
	else tokenizer.encode(prompt)])

	# TinyBuddy ships a custom `.generate(...)` (top-k sampling). Use it directly:
	out = model.generate(input_ids, max_new_tokens=120, temperature=0.8, top_k=50)
	print(tokenizer.decode(out[0].tolist()))
	```

	If you prefer to bypass `transformers` entirely, you can use the raw
	`tokenizers` library + the included modeling file:

	```python
	from tokenizers import Tokenizer
	from safetensors.torch import load_file
	from modeling_tinybuddy import TinyGPT, GPTConfig
	import json, torch

	cfg = GPTConfig(**{k: v for k, v in json.load(open("config.json")).items()
	if k in GPTConfig.__dataclass_fields__})
	model = TinyGPT(cfg)
	model.load_state_dict(load_file("model.safetensors"))
	model.eval()

	tok = Tokenizer.from_file("tokenizer.json")
	ids = tok.encode("Once upon a time").ids
	out = model.generate(torch.tensor([ids]), max_new_tokens=80, temperature=0.8, top_k=50)
	print(tok.decode(out[0].tolist()))
	```

	## Example outputs

	Prompt: "Once upon a time, there was a little girl named Lily."

	> Once upon a time, there was a little girl named Lily. They loved to play
	> with their parents. One day, Tom went to the park. The sun loved the box
	> and had many friends. One day, they went for a small tree, a lot of friends.
	> He said, "What is better. But you want to find your friends, Bob?" …

	Prompt: "Tom and Sam were playing in the park when"

	> Tom and Sam were playing in the park when they were very much. Once upon a
	> time, there was a girl named The cat with her mom. They had a little girl
	> named Mia. She loved to play with her friends and play with her mom. …

	## Limitations

	Be honest with yourself: this model is bad, and that is expected.

	What works ✅
	- Vocabulary & register match TinyStories (short sentences, character names
	like Tim/Lily/Spot, motifs like "Once upon a time", "the park").
	- Local grammar is mostly intact (subject–verb–object, quoted dialogue,
	punctuation).
	- Document boundaries (`<\|endoftext\|>`) are respected.

	What's broken ❌
	- No narrative coherence across more than one or two sentences.
	- Character drift — characters appear, vanish, or swap names mid-story.
	- Pronoun confusion ("They" referring to a single girl).
	- Ungrammatical fragments ("She found a very happy.").
	- Repetition loops ("play with X. play with Y. play with Z.").
	- No factual knowledge, no reasoning, no instruction following.

	### Why

	\| Factor \| This model \| A good TinyStories-class model \|
	\| --- \| --- \| --- \|
	\| Tokens seen \| ~0.77 M \| ~10⁹+ \|
	\| Hardware \| 2 CPU cores \| 1+ GPUs \|
	\| Wall time \| ~12 min \| many hours \|
	\| Final loss \| ~3.5 \| ~1.3–1.6 \|
	\| Perplexity \| ~30 \| ~4–5 \|

	This is roughly 3–4 orders of magnitude less compute than a serious
	TinyStories training run. The architecture and pipeline are correct; only
	the optimization budget is tiny.

	### Intended use

	- ✅ Educational reference for building / training / packaging a small LM.
	- ✅ Sanity-checking a training pipeline.
	- ✅ Demonstrating safetensors + Hugging Face Hub packaging.
	- ❌ Not for any production, user-facing, or assistive use case.
	- ❌ Not a source of factual information.
	- ❌ Not safe for inputs from untrusted users (no safety training).

	## Bias, risks, and safety

	The training data is TinyStories — synthetic children's stories generated
	by GPT-3.5/4. The model has not undergone any safety, RLHF, or
	instruction-tuning step. It may produce nonsensical, biased, or repetitive
	output, and should not be deployed in any setting where output quality or
	safety matters.

	## License

	MIT.

	## Citation

	If you use this code or model in teaching materials, please cite as:

	```
	@misc{tinybuddy30m,
	title = {TinyBuddy-30M: a from-scratch ~30M-parameter transformer trained on TinyStories},
	year = {2026},
	note = {Educational demonstration model.}
	}
	```

	And please cite TinyStories:

	```
	@article{eldan2023tinystories,
	title = {TinyStories: How Small Can Language Models Be and Still Speak Coherent English?},
	author = {Eldan, Ronen and Li, Yuanzhi},
	journal = {arXiv preprint arXiv:2305.07759},
	year = {2023}
	}
	```

	## Built with Llama

	This model's architecture is inspired by the LLaMA family of decoder-only
	transformer language models (pre-norm, causal multi-head self-attention,
	GELU MLP). The implementation is from-scratch PyTorch and does not include
	any LLaMA weights, but follows the same overall design pattern.

	Built with Llama.