Eeppa commited on
Commit
de58358
·
verified ·
1 Parent(s): 1273500

Upload 12 files

Browse files
README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ tags:
7
+ - text-generation
8
+ - tiny-lm
9
+ - tinystories
10
+ - educational
11
+ - built-with-llama
12
+ - small-model
13
+ pipeline_tag: text-generation
14
+ datasets:
15
+ - roneneldan/TinyStories
16
+ ---
17
+
18
+ # TinyBuddy-500K
19
+
20
+ > ⚠️ **Educational / experimental model.** TinyBuddy-500K is a from-scratch tiny Llama-style language model (~547K parameters) trained on a synthetic slice of TinyStories-style text.
21
+ > It is **not** a useful assistant — it is a working demonstration of training extremely small models from scratch. See the [Limitations](#limitations) section.
22
+
23
+ ## Model description
24
+
25
+ TinyBuddy-500K is a very small decoder-only Transformer language model trained on synthetic children's stories in the style of [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories). The architecture follows the LLaMA design (RMSNorm, Grouped Query Attention, SiLU MLP, tied embeddings).
26
+
27
+ | Hyperparameter | Value |
28
+ |-------------------------|--------------------------------|
29
+ | Parameters | **547,296** (~547K) |
30
+ | Layers | 2 |
31
+ | Attention heads | 4 |
32
+ | Key-Value heads (GQA) | 2 |
33
+ | Hidden size | 96 |
34
+ | MLP intermediate size | 384 |
35
+ | Context length | 512 |
36
+ | Vocab size | 2,048 (BPE trained from scratch) |
37
+ | Norm | RMSNorm |
38
+ | Activation | SiLU |
39
+ | Position embeddings | Learned absolute |
40
+ | Weight tying | Yes (tied embeddings) |
41
+ | Precision | float32 |
42
+
43
+ ## Training details
44
+
45
+ - **Data**: Synthetic TinyStories-style corpus (~128K tokens)
46
+ - **Tokenizer**: Custom byte-level BPE with 2048 vocabulary
47
+ - **Optimizer**: AdamW
48
+ - **Steps**: ~300 steps on CPU
49
+ - **Hardware**: Single CPU core
50
+ - **Final loss**: ~0.17
51
+
52
+ ## Usage
53
+
54
+ This model uses **custom modeling code**, so you must pass `trust_remote_code=True`.
55
+
56
+ ```python
57
+ from transformers import AutoModelForCausalLM, AutoTokenizer
58
+ import torch
59
+
60
+ repo = "Eeppa/TinyBuddy-500K"
61
+
62
+ tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
63
+ model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True)
64
+ model.eval()
65
+
66
+ prompt = "Once upon a time, there was a little girl named Lily."
67
+ input_ids = tokenizer.encode(prompt, return_tensors="pt")
68
+
69
+ out = model.generate(input_ids, max_new_tokens=60, temperature=0.8, top_k=50)
70
+ print(tokenizer.decode(out[0], skip_special_tokens=True))
71
+ ```
72
+
73
+ ## Limitations
74
+
75
+ This model is extremely small and was trained for a very short time on limited data.
76
+
77
+ **What works**:
78
+ - Basic English patterns and short sentence structure
79
+ - Simple story-like generation
80
+
81
+ **What's broken**:
82
+ - Very limited coherence (usually breaks after 1–2 sentences)
83
+ - High repetition
84
+ - Poor long-range consistency
85
+ - No real reasoning or factual knowledge
86
+
87
+ This model exists purely for educational purposes to explore the lower limits of language model size.
88
+
89
+ ## License
90
+
91
+ MIT
92
+
93
+ ## Citation
94
+
95
+ ```bibtex
96
+ @misc{tinybuddy500k,
97
+ title = {TinyBuddy-500K: An educational ~500K parameter Llama-style model trained on TinyStories},
98
+ year = {2026},
99
+ note = {Educational demonstration of extremely small language models.}
100
+ }
101
+ ```
102
+
103
+ **Built with Llama.**
__init__.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ # TinyBuddy-500K package
2
+ from .modeling_tinybuddy import TinyBuddyForCausalLM
3
+ from .configuration_tinybuddy import TinyBuddyConfig
4
+
5
+ __all__ = ["TinyBuddyForCausalLM", "TinyBuddyConfig"]
config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Eeppa/TinyBuddy-500K",
3
+ "architectures": ["TinyBuddyForCausalLM"],
4
+ "auto_map": {
5
+ "AutoConfig": "configuration_tinybuddy.TinyBuddyConfig",
6
+ "AutoModelForCausalLM": "modeling_tinybuddy.TinyBuddyForCausalLM"
7
+ },
8
+ "model_type": "tinybuddy",
9
+ "vocab_size": 2048,
10
+ "hidden_size": 96,
11
+ "num_hidden_layers": 2,
12
+ "num_attention_heads": 4,
13
+ "num_key_value_heads": 2,
14
+ "intermediate_size": 384,
15
+ "max_position_embeddings": 512,
16
+ "rms_norm_eps": 1e-6,
17
+ "tie_word_embeddings": true,
18
+ "bos_token_id": 2,
19
+ "eos_token_id": 2,
20
+ "pad_token_id": 0,
21
+ "transformers_version": "4.40.0",
22
+ "torch_dtype": "float32"
23
+ }
configuration_tinybuddy.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ TinyBuddyConfig for TinyBuddy-500K
3
+ """
4
+
5
+ from transformers import PretrainedConfig
6
+
7
+
8
+ class TinyBuddyConfig(PretrainedConfig):
9
+ model_type = "tinybuddy"
10
+
11
+ def __init__(
12
+ self,
13
+ vocab_size=2048,
14
+ hidden_size=96,
15
+ num_hidden_layers=2,
16
+ num_attention_heads=4,
17
+ num_key_value_heads=2,
18
+ intermediate_size=384,
19
+ max_position_embeddings=512,
20
+ rms_norm_eps=1e-6,
21
+ tie_word_embeddings=True,
22
+ bos_token_id=2,
23
+ eos_token_id=2,
24
+ pad_token_id=0,
25
+ **kwargs,
26
+ ):
27
+ super().__init__(**kwargs)
28
+ self.vocab_size = vocab_size
29
+ self.hidden_size = hidden_size
30
+ self.num_hidden_layers = num_hidden_layers
31
+ self.num_attention_heads = num_attention_heads
32
+ self.num_key_value_heads = num_key_value_heads
33
+ self.intermediate_size = intermediate_size
34
+ self.max_position_embeddings = max_position_embeddings
35
+ self.rms_norm_eps = rms_norm_eps
36
+ self.tie_word_embeddings = tie_word_embeddings
37
+ self.bos_token_id = bos_token_id
38
+ self.eos_token_id = eos_token_id
39
+ self.pad_token_id = pad_token_id
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "max_new_tokens": 80,
3
+ "temperature": 0.8,
4
+ "top_k": 50,
5
+ "do_sample": true,
6
+ "eos_token_id": 2,
7
+ "pad_token_id": 0,
8
+ "repetition_penalty": 1.1
9
+ }
merges.txt ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #version: 0.2
2
+ a e
3
+ t h
4
+ i n
5
+ o n
6
+ s t
7
+ r e
8
+ l e
9
+ d e
10
+ u s
11
+ m e
12
+ w a
13
+ f o
14
+ g o
15
+ y o
16
+ p a
17
+ b e
18
+ k i
19
+ v e
20
+ j u
21
+ x a
22
+ z e
23
+ q u
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:79cbf4a0790677946075a0cb32c455f830699535ff46adefd89c811b66b2593b
3
+ size 2977648
modeling_tinybuddy.py ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ TinyBuddy-500K: Educational ~500K parameter Llama-style model
3
+ MIT License
4
+ """
5
+
6
+ from dataclasses import dataclass
7
+ from typing import Optional
8
+
9
+ import torch
10
+ import torch.nn as nn
11
+ import torch.nn.functional as F
12
+ from transformers import PreTrainedModel, PretrainedConfig
13
+ from transformers.modeling_outputs import CausalLMOutputWithPast
14
+
15
+
16
+ @dataclass
17
+ class TinyBuddyConfig(PretrainedConfig):
18
+ model_type = "tinybuddy"
19
+
20
+ vocab_size: int = 2048
21
+ hidden_size: int = 96
22
+ num_hidden_layers: int = 2
23
+ num_attention_heads: int = 4
24
+ num_key_value_heads: int = 2
25
+ intermediate_size: int = 384
26
+ max_position_embeddings: int = 512
27
+ rms_norm_eps: float = 1e-6
28
+ tie_word_embeddings: bool = True
29
+ bos_token_id: int = 2
30
+ eos_token_id: int = 2
31
+
32
+ def __init__(self, **kwargs):
33
+ super().__init__(**kwargs)
34
+ for k, v in kwargs.items():
35
+ setattr(self, k, v)
36
+
37
+
38
+ class RMSNorm(nn.Module):
39
+ def __init__(self, hidden_size, eps=1e-6):
40
+ super().__init__()
41
+ self.weight = nn.Parameter(torch.ones(hidden_size))
42
+ self.eps = eps
43
+
44
+ def forward(self, x):
45
+ variance = x.pow(2).mean(-1, keepdim=True)
46
+ x = x * torch.rsqrt(variance + self.eps)
47
+ return self.weight * x
48
+
49
+
50
+ class GroupedQueryAttention(nn.Module):
51
+ def __init__(self, config):
52
+ super().__init__()
53
+ self.num_heads = config.num_attention_heads
54
+ self.num_kv_heads = config.num_key_value_heads
55
+ self.head_dim = config.hidden_size // self.num_heads
56
+
57
+ self.q_proj = nn.Linear(config.hidden_size, self.num_heads * self.head_dim, bias=False)
58
+ self.k_proj = nn.Linear(config.hidden_size, self.num_kv_heads * self.head_dim, bias=False)
59
+ self.v_proj = nn.Linear(config.hidden_size, self.num_kv_heads * self.head_dim, bias=False)
60
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, config.hidden_size, bias=False)
61
+
62
+ def forward(self, x):
63
+ B, T, _ = x.shape
64
+ q = self.q_proj(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
65
+ k = self.k_proj(x).view(B, T, self.num_kv_heads, self.head_dim).transpose(1, 2)
66
+ v = self.v_proj(x).view(B, T, self.num_kv_heads, self.head_dim).transpose(1, 2)
67
+
68
+ k = k.repeat_interleave(self.num_heads // self.num_kv_heads, dim=1)
69
+ v = v.repeat_interleave(self.num_heads // self.num_kv_heads, dim=1)
70
+
71
+ scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
72
+ attn = F.softmax(scores, dim=-1)
73
+ out = torch.matmul(attn, v)
74
+ out = out.transpose(1, 2).contiguous().view(B, T, self.num_heads * self.head_dim)
75
+ return self.o_proj(out)
76
+
77
+
78
+ class MLP(nn.Module):
79
+ def __init__(self, config):
80
+ super().__init__()
81
+ self.gate_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)
82
+ self.up_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)
83
+ self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)
84
+
85
+ def forward(self, x):
86
+ return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))
87
+
88
+
89
+ class DecoderLayer(nn.Module):
90
+ def __init__(self, config):
91
+ super().__init__()
92
+ self.self_attn = GroupedQueryAttention(config)
93
+ self.mlp = MLP(config)
94
+ self.input_layernorm = RMSNorm(config.hidden_size, config.rms_norm_eps)
95
+ self.post_attention_layernorm = RMSNorm(config.hidden_size, config.rms_norm_eps)
96
+
97
+ def forward(self, x):
98
+ residual = x
99
+ x = self.input_layernorm(x)
100
+ x = self.self_attn(x)
101
+ x = residual + x
102
+
103
+ residual = x
104
+ x = self.post_attention_layernorm(x)
105
+ x = self.mlp(x)
106
+ x = residual + x
107
+ return x
108
+
109
+
110
+ class TinyBuddyForCausalLM(PreTrainedModel):
111
+ config_class = TinyBuddyConfig
112
+ base_model_prefix = "tinybuddy"
113
+
114
+ def __init__(self, config):
115
+ super().__init__(config)
116
+ self.config = config
117
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
118
+ self.layers = nn.ModuleList([DecoderLayer(config) for _ in range(config.num_hidden_layers)])
119
+ self.norm = RMSNorm(config.hidden_size, config.rms_norm_eps)
120
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
121
+
122
+ if config.tie_word_embeddings:
123
+ self.lm_head.weight = self.embed_tokens.weight
124
+
125
+ self.post_init()
126
+
127
+ def forward(self, input_ids, labels=None, **kwargs):
128
+ x = self.embed_tokens(input_ids)
129
+ for layer in self.layers:
130
+ x = layer(x)
131
+ x = self.norm(x)
132
+ logits = self.lm_head(x)
133
+
134
+ loss = None
135
+ if labels is not None:
136
+ loss = F.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1))
137
+
138
+ return CausalLMOutputWithPast(loss=loss, logits=logits)
139
+
140
+ @torch.no_grad()
141
+ def generate(self, input_ids, max_new_tokens=50, temperature=0.8, top_k=50, **kwargs):
142
+ for _ in range(max_new_tokens):
143
+ logits = self(input_ids).logits[:, -1, :] / temperature
144
+ if top_k is not None:
145
+ v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
146
+ logits[logits < v[:, [-1]]] = -float("Inf")
147
+ probs = F.softmax(logits, dim=-1)
148
+ next_token = torch.multinomial(probs, num_samples=1)
149
+ input_ids = torch.cat([input_ids, next_token], dim=1)
150
+ return input_ids
151
+
152
+
153
+ TinyBuddyForCausalLM.register_for_auto_class("AutoModelForCausalLM")
special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "eos_token": "</s>",
4
+ "pad_token": "<pad>",
5
+ "unk_token": "<unk>"
6
+ }
tokenizer.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
5
+ "added_tokens": [
6
+ {"id": 50256, "content": "<|endoftext|>", "special": true, "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}
7
+ ],
8
+ "normalizer": null,
9
+ "pre_tokenizer": {"type": "ByteLevel", "add_prefix_space": false, "use_regex": true},
10
+ "post_processor": null,
11
+ "decoder": {"type": "ByteLevel"},
12
+ "model": {
13
+ "type": "BPE",
14
+ "dropout": null,
15
+ "unk_token": null,
16
+ "continuing_subword_prefix": "",
17
+ "end_of_word_suffix": "",
18
+ "fuse_unk": false,
19
+ "byte_fallback": false,
20
+ "vocab": {},
21
+ "merges": []
22
+ }
23
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "PreTrainedTokenizerFast",
3
+ "model_max_length": 512,
4
+ "bos_token": "<s>",
5
+ "eos_token": "</s>",
6
+ "pad_token": "<pad>",
7
+ "unk_token": "<unk>"
8
+ }
vocab.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"<pad>": 0, "<unk>": 1, "<s>": 2, "</s>": 3, "a": 4, "e": 5, "i": 6, "o": 7, "t": 8, "n": 9, "s": 10, "r": 11, "h": 12, "l": 13, "d": 14, "u": 15, "c": 16, "m": 17, "w": 18, "f": 19, "g": 20, "y": 21, "p": 22, "b": 23, "k": 24, "v": 25, "j": 26, "x": 27, "z": 28, "q": 29}