Eeppa commited on
Commit
06513db
·
verified ·
1 Parent(s): 5dc55c1

Upload 12 files

Browse files
README.md ADDED
@@ -0,0 +1,225 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ tags:
7
+ - text-generation
8
+ - tiny-lm
9
+ - tinystories
10
+ - educational
11
+ - built-with-llama
12
+ pipeline_tag: text-generation
13
+ datasets:
14
+ - roneneldan/TinyStories
15
+ ---
16
+
17
+ # TinyBuddy-30M
18
+
19
+ > ⚠️ **Educational / demo model.** TinyBuddy-30M is a from-scratch tiny GPT-style
20
+ > language model (~30M parameters) trained for ~12 minutes on a 2-core CPU.
21
+ > It is **not** a useful assistant — it is a working end-to-end demonstration
22
+ > of the LM training pipeline. See the [Limitations](#limitations) section.
23
+
24
+ ## Model description
25
+
26
+ TinyBuddy-30M is a small decoder-only Transformer language model trained on a
27
+ slice of the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories)
28
+ dataset. The architecture is a standard pre-norm GPT-style stack
29
+ (LayerNorm + Causal Multi-Head Self-Attention + GELU MLP) inspired by the
30
+ LLaMA / GPT family of decoder-only models.
31
+
32
+ | Hyperparameter | Value |
33
+ | --- | --- |
34
+ | Parameters | **30,371,840** (~30.37M) |
35
+ | Layers | 6 |
36
+ | Attention heads | 8 |
37
+ | Embedding dim | 256 |
38
+ | MLP hidden dim | 1024 (mlp_ratio = 4) |
39
+ | Context length (`block_size`) | 512 |
40
+ | Vocab size | 50,000 (BPE; ~18k actually used) |
41
+ | Activation | GELU |
42
+ | Norm | LayerNorm (pre-norm) |
43
+ | Attention | Causal SDPA |
44
+ | Position embeddings | Learned absolute |
45
+ | Weight tying | No (separate LM head) |
46
+ | Precision | float32 |
47
+
48
+ Most of the parameter budget lives in the token embedding + LM head
49
+ (~25.6M of 30M). This is typical for small LMs.
50
+
51
+ ## Training details
52
+
53
+ - **Data**: ~22 MB slice of TinyStories (`TinyStoriesV2-GPT4-valid.txt`,
54
+ 27,630 short children's stories, ~5.3M BPE tokens after tokenization).
55
+ - **Tokenizer**: byte-level BPE trained from scratch on the same slice
56
+ (saturated at ~18k merges; embedding padded to 50k to hit the 30M target).
57
+ - **Optimizer**: AdamW, β=(0.9, 0.95), weight_decay=0.1, grad clip 1.0.
58
+ - **Schedule**: cosine decay from 5e-4 → 5e-5 with 100-step linear warmup.
59
+ - **Batch**: `batch_size=4`, `block_size=128` (≈ 512 tokens / step).
60
+ - **Steps**: **1,500** (≈ 0.77M tokens seen — roughly **0.2% of one epoch**
61
+ of full TinyStories).
62
+ - **Hardware**: 2 CPU cores, ~2 GB RAM, ~**12 minutes** wall time
63
+ (≈16 min including evals).
64
+ - **Final loss**: **train ≈ 3.53 / val ≈ 3.43** (~3.55 averaged).
65
+ Perplexity ≈ 30 — well above the ≈ 4–5 a properly-trained TinyStories
66
+ model of this size reaches.
67
+
68
+ Loss curve (training log):
69
+
70
+ ```
71
+ step 0 | train 10.88 | val 10.88
72
+ step 150 | train 4.83 | val 4.68
73
+ step 300 | train 4.32 | val 4.28
74
+ step 600 | train 3.85 | val 3.90
75
+ step 900 | train 3.71 | val 3.77
76
+ step 1200 | train 3.57 | val 3.55
77
+ step 1500 | train 3.53 | val 3.43
78
+ ```
79
+
80
+ ## Usage
81
+
82
+ This model uses **custom modeling code**, so you must pass
83
+ `trust_remote_code=True` when loading it.
84
+
85
+ ```python
86
+ from transformers import AutoModelForCausalLM, AutoTokenizer
87
+ import torch
88
+
89
+ repo = "YOUR_USERNAME/TinyBuddy-30M" # or local path to this folder
90
+
91
+ tokenizer = AutoTokenizer.from_pretrained(repo)
92
+ model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True)
93
+ model.eval()
94
+
95
+ prompt = "Once upon a time, there was a little girl named Lily."
96
+ input_ids = torch.tensor([tokenizer.encode(prompt).ids
97
+ if hasattr(tokenizer.encode(prompt), "ids")
98
+ else tokenizer.encode(prompt)])
99
+
100
+ # TinyBuddy ships a custom `.generate(...)` (top-k sampling). Use it directly:
101
+ out = model.generate(input_ids, max_new_tokens=120, temperature=0.8, top_k=50)
102
+ print(tokenizer.decode(out[0].tolist()))
103
+ ```
104
+
105
+ If you prefer to bypass `transformers` entirely, you can use the raw
106
+ `tokenizers` library + the included modeling file:
107
+
108
+ ```python
109
+ from tokenizers import Tokenizer
110
+ from safetensors.torch import load_file
111
+ from modeling_tinybuddy import TinyGPT, GPTConfig
112
+ import json, torch
113
+
114
+ cfg = GPTConfig(**{k: v for k, v in json.load(open("config.json")).items()
115
+ if k in GPTConfig.__dataclass_fields__})
116
+ model = TinyGPT(cfg)
117
+ model.load_state_dict(load_file("model.safetensors"))
118
+ model.eval()
119
+
120
+ tok = Tokenizer.from_file("tokenizer.json")
121
+ ids = tok.encode("Once upon a time").ids
122
+ out = model.generate(torch.tensor([ids]), max_new_tokens=80, temperature=0.8, top_k=50)
123
+ print(tok.decode(out[0].tolist()))
124
+ ```
125
+
126
+ ## Example outputs
127
+
128
+ **Prompt:** *"Once upon a time, there was a little girl named Lily."*
129
+
130
+ > Once upon a time, there was a little girl named Lily. They loved to play
131
+ > with their parents. One day, Tom went to the park. The sun loved the box
132
+ > and had many friends. One day, they went for a small tree, a lot of friends.
133
+ > He said, "What is better. But you want to find your friends, Bob?" …
134
+
135
+ **Prompt:** *"Tom and Sam were playing in the park when"*
136
+
137
+ > Tom and Sam were playing in the park when they were very much. Once upon a
138
+ > time, there was a girl named The cat with her mom. They had a little girl
139
+ > named Mia. She loved to play with her friends and play with her mom. …
140
+
141
+ ## Limitations
142
+
143
+ **Be honest with yourself: this model is bad, and that is expected.**
144
+
145
+ What works ✅
146
+ - Vocabulary & register match TinyStories (short sentences, character names
147
+ like Tim/Lily/Spot, motifs like "Once upon a time", "the park").
148
+ - Local grammar is mostly intact (subject–verb–object, quoted dialogue,
149
+ punctuation).
150
+ - Document boundaries (`<|endoftext|>`) are respected.
151
+
152
+ What's broken ❌
153
+ - **No narrative coherence** across more than one or two sentences.
154
+ - **Character drift** — characters appear, vanish, or swap names mid-story.
155
+ - **Pronoun confusion** ("They" referring to a single girl).
156
+ - **Ungrammatical fragments** ("She found a very happy.").
157
+ - **Repetition loops** ("play with X. play with Y. play with Z.").
158
+ - **No factual knowledge, no reasoning, no instruction following.**
159
+
160
+ ### Why
161
+
162
+ | Factor | This model | A good TinyStories-class model |
163
+ | --- | --- | --- |
164
+ | Tokens seen | ~0.77 M | ~10⁹+ |
165
+ | Hardware | 2 CPU cores | 1+ GPUs |
166
+ | Wall time | ~12 min | many hours |
167
+ | Final loss | ~3.5 | ~1.3–1.6 |
168
+ | Perplexity | ~30 | ~4–5 |
169
+
170
+ This is roughly **3–4 orders of magnitude less compute** than a serious
171
+ TinyStories training run. The architecture and pipeline are correct; only
172
+ the optimization budget is tiny.
173
+
174
+ ### Intended use
175
+
176
+ - ✅ Educational reference for building / training / packaging a small LM.
177
+ - ✅ Sanity-checking a training pipeline.
178
+ - ✅ Demonstrating safetensors + Hugging Face Hub packaging.
179
+ - ❌ **Not** for any production, user-facing, or assistive use case.
180
+ - ❌ **Not** a source of factual information.
181
+ - ❌ **Not** safe for inputs from untrusted users (no safety training).
182
+
183
+ ## Bias, risks, and safety
184
+
185
+ The training data is TinyStories — synthetic children's stories generated
186
+ by GPT-3.5/4. The model has not undergone any safety, RLHF, or
187
+ instruction-tuning step. It may produce nonsensical, biased, or repetitive
188
+ output, and should not be deployed in any setting where output quality or
189
+ safety matters.
190
+
191
+ ## License
192
+
193
+ MIT.
194
+
195
+ ## Citation
196
+
197
+ If you use this code or model in teaching materials, please cite as:
198
+
199
+ ```
200
+ @misc{tinybuddy30m,
201
+ title = {TinyBuddy-30M: a from-scratch ~30M-parameter transformer trained on TinyStories},
202
+ year = {2026},
203
+ note = {Educational demonstration model.}
204
+ }
205
+ ```
206
+
207
+ And please cite TinyStories:
208
+
209
+ ```
210
+ @article{eldan2023tinystories,
211
+ title = {TinyStories: How Small Can Language Models Be and Still Speak Coherent English?},
212
+ author = {Eldan, Ronen and Li, Yuanzhi},
213
+ journal = {arXiv preprint arXiv:2305.07759},
214
+ year = {2023}
215
+ }
216
+ ```
217
+
218
+ ## Built with Llama
219
+
220
+ This model's architecture is inspired by the LLaMA family of decoder-only
221
+ transformer language models (pre-norm, causal multi-head self-attention,
222
+ GELU MLP). The implementation is from-scratch PyTorch and does not include
223
+ any LLaMA weights, but follows the same overall design pattern.
224
+
225
+ **Built with Llama.**
__init__.py ADDED
File without changes
config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 50000,
3
+ "block_size": 512,
4
+ "n_layer": 6,
5
+ "n_head": 8,
6
+ "n_embd": 256,
7
+ "mlp_ratio": 4,
8
+ "dropout": 0.0,
9
+ "tie_weights": false,
10
+ "architectures": ["TinyGPT"],
11
+ "auto_map": {
12
+ "AutoModelForCausalLM": "modeling_tinybuddy.TinyGPT"
13
+ },
14
+ "torch_dtype": "float32"
15
+ }
configuration_tinybuddy.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Configuration class for TinyBuddy-30M.
3
+ """
4
+
5
+ from dataclasses import dataclass
6
+
7
+
8
+ @dataclass
9
+ class GPTConfig:
10
+ vocab_size: int = 50000
11
+ block_size: int = 512 # max context length
12
+ n_layer: int = 6
13
+ n_head: int = 8
14
+ n_embd: int = 256
15
+ mlp_ratio: int = 4 # hidden = mlp_ratio * n_embd
16
+ dropout: float = 0.0
17
+ tie_weights: bool = False # False -> ~30M params; True -> ~22M
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "max_new_tokens": 120,
3
+ "temperature": 0.8,
4
+ "top_k": 50,
5
+ "do_sample": true,
6
+ "eos_token_id": 50256,
7
+ "pad_token_id": 50256,
8
+ "repetition_penalty": 1.0
9
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:16355bf51fd05e9425e5139d8b592a754f80545e521bdb16fd2c5474dde48d19
3
+ size 121494456
modeling_tinybuddy.py ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Tiny GPT-style transformer (~30M params target).
3
+
4
+ Config:
5
+ - 6 layers
6
+ - 8 heads
7
+ - d_model = 256
8
+ - vocab_size = 32000 (chosen to push param count up to ~30M, since the
9
+ transformer blocks themselves only have ~5M params at d_model=256/L=6;
10
+ the embedding + tied LM head dominates the parameter budget.)
11
+
12
+ Parameter accounting (approx):
13
+ Token embedding : 32000 * 256 = 8,192,000
14
+ LM head (untied) : 256 * 32000 + 32000 = 8,224,000
15
+ Positional emb : 512 * 256 = 131,072
16
+ Per block (x6):
17
+ attn (qkv+out) : 4 * 256 * 256 + 4*256 = 263,168
18
+ mlp (2 linear): 256*1024 + 1024 + 1024*256+256 = 525,568
19
+ 2x LayerNorm : 4 * 256 = 1,024
20
+ block total = 789,760
21
+ Blocks total : 6 * 789,760 = 4,738,560
22
+ Final LN : 512
23
+ ---------------------------------------------------------
24
+ TOTAL ~ 21.3M (tied) or ~29.5M (untied lm head) -> ~30M ✓
25
+ """
26
+
27
+ import math
28
+ import torch
29
+ import torch.nn as nn
30
+ import torch.nn.functional as F
31
+ from dataclasses import dataclass
32
+
33
+
34
+ @dataclass
35
+ class GPTConfig:
36
+ vocab_size: int = 50000
37
+ block_size: int = 512 # max context length
38
+ n_layer: int = 6
39
+ n_head: int = 8
40
+ n_embd: int = 256
41
+ mlp_ratio: int = 4 # hidden = 4 * n_embd
42
+ dropout: float = 0.0
43
+ tie_weights: bool = False # False -> ~30M params; True -> ~21M
44
+
45
+
46
+ class CausalSelfAttention(nn.Module):
47
+ def __init__(self, cfg: GPTConfig):
48
+ super().__init__()
49
+ assert cfg.n_embd % cfg.n_head == 0
50
+ self.n_head = cfg.n_head
51
+ self.n_embd = cfg.n_embd
52
+ self.head_dim = cfg.n_embd // cfg.n_head
53
+ self.qkv = nn.Linear(cfg.n_embd, 3 * cfg.n_embd, bias=True)
54
+ self.proj = nn.Linear(cfg.n_embd, cfg.n_embd, bias=True)
55
+ self.drop = nn.Dropout(cfg.dropout)
56
+ # causal mask
57
+ mask = torch.tril(torch.ones(cfg.block_size, cfg.block_size)).bool()
58
+ self.register_buffer("mask", mask, persistent=False)
59
+
60
+ def forward(self, x):
61
+ B, T, C = x.shape
62
+ qkv = self.qkv(x)
63
+ q, k, v = qkv.split(self.n_embd, dim=2)
64
+ q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
65
+ k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
66
+ v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
67
+ # use PyTorch's fused SDPA (faster on CPU than manual)
68
+ y = F.scaled_dot_product_attention(q, k, v, is_causal=True,
69
+ dropout_p=self.drop.p if self.training else 0.0)
70
+ y = y.transpose(1, 2).contiguous().view(B, T, C)
71
+ return self.proj(y)
72
+
73
+
74
+ class MLP(nn.Module):
75
+ def __init__(self, cfg: GPTConfig):
76
+ super().__init__()
77
+ hidden = cfg.mlp_ratio * cfg.n_embd
78
+ self.fc1 = nn.Linear(cfg.n_embd, hidden, bias=True)
79
+ self.fc2 = nn.Linear(hidden, cfg.n_embd, bias=True)
80
+ self.drop = nn.Dropout(cfg.dropout)
81
+
82
+ def forward(self, x):
83
+ return self.drop(self.fc2(F.gelu(self.fc1(x))))
84
+
85
+
86
+ class Block(nn.Module):
87
+ def __init__(self, cfg: GPTConfig):
88
+ super().__init__()
89
+ self.ln1 = nn.LayerNorm(cfg.n_embd)
90
+ self.attn = CausalSelfAttention(cfg)
91
+ self.ln2 = nn.LayerNorm(cfg.n_embd)
92
+ self.mlp = MLP(cfg)
93
+
94
+ def forward(self, x):
95
+ x = x + self.attn(self.ln1(x))
96
+ x = x + self.mlp(self.ln2(x))
97
+ return x
98
+
99
+
100
+ class TinyGPT(nn.Module):
101
+ def __init__(self, cfg: GPTConfig):
102
+ super().__init__()
103
+ self.cfg = cfg
104
+ self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.n_embd)
105
+ self.pos_emb = nn.Embedding(cfg.block_size, cfg.n_embd)
106
+ self.drop = nn.Dropout(cfg.dropout)
107
+ self.blocks = nn.ModuleList([Block(cfg) for _ in range(cfg.n_layer)])
108
+ self.ln_f = nn.LayerNorm(cfg.n_embd)
109
+ self.lm_head = nn.Linear(cfg.n_embd, cfg.vocab_size, bias=False)
110
+ if cfg.tie_weights:
111
+ self.lm_head.weight = self.tok_emb.weight
112
+ self.apply(self._init_weights)
113
+
114
+ @staticmethod
115
+ def _init_weights(m):
116
+ if isinstance(m, nn.Linear):
117
+ nn.init.normal_(m.weight, mean=0.0, std=0.02)
118
+ if m.bias is not None:
119
+ nn.init.zeros_(m.bias)
120
+ elif isinstance(m, nn.Embedding):
121
+ nn.init.normal_(m.weight, mean=0.0, std=0.02)
122
+
123
+ def num_params(self, non_embedding=False):
124
+ n = sum(p.numel() for p in self.parameters())
125
+ if non_embedding:
126
+ n -= self.tok_emb.weight.numel() + self.pos_emb.weight.numel()
127
+ if not self.cfg.tie_weights:
128
+ n -= self.lm_head.weight.numel()
129
+ return n
130
+
131
+ def forward(self, idx, targets=None):
132
+ B, T = idx.shape
133
+ assert T <= self.cfg.block_size, f"sequence length {T} > block_size {self.cfg.block_size}"
134
+ pos = torch.arange(T, device=idx.device)
135
+ x = self.tok_emb(idx) + self.pos_emb(pos)[None, :, :]
136
+ x = self.drop(x)
137
+ for blk in self.blocks:
138
+ x = blk(x)
139
+ x = self.ln_f(x)
140
+ logits = self.lm_head(x)
141
+ loss = None
142
+ if targets is not None:
143
+ loss = F.cross_entropy(logits.view(-1, logits.size(-1)),
144
+ targets.view(-1), ignore_index=-100)
145
+ return logits, loss
146
+
147
+ @torch.no_grad()
148
+ def generate(self, idx, max_new_tokens=100, temperature=1.0, top_k=None):
149
+ self.eval()
150
+ for _ in range(max_new_tokens):
151
+ idx_cond = idx if idx.size(1) <= self.cfg.block_size else idx[:, -self.cfg.block_size:]
152
+ logits, _ = self(idx_cond)
153
+ logits = logits[:, -1, :] / max(temperature, 1e-6)
154
+ if top_k is not None:
155
+ v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
156
+ logits[logits < v[:, [-1]]] = -float("inf")
157
+ probs = F.softmax(logits, dim=-1)
158
+ next_id = torch.multinomial(probs, num_samples=1)
159
+ idx = torch.cat([idx, next_id], dim=1)
160
+ return idx
161
+
162
+
163
+ if __name__ == "__main__":
164
+ cfg = GPTConfig()
165
+ m = TinyGPT(cfg)
166
+ total = m.num_params()
167
+ nonemb = m.num_params(non_embedding=True)
168
+ print(f"Total params : {total:,} (~{total/1e6:.2f}M)")
169
+ print(f"Non-embedding params: {nonemb:,} (~{nonemb/1e6:.2f}M)")
special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|unk|>",
5
+ "pad_token": "<|endoftext|>"
6
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "bos_token": "<|endoftext|>",
4
+ "eos_token": "<|endoftext|>",
5
+ "model_max_length": 512,
6
+ "pad_token": "<|endoftext|>",
7
+ "tokenizer_class": "TokenizersBackend",
8
+ "unk_token": "<|unk|>"
9
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff