Upload 12 files

Browse files

Files changed (12) hide show

README.md +225 -0
__init__.py +0 -0
config.json +15 -0
configuration_tinybuddy.py +17 -0
generation_config.json +9 -0
merges.txt +0 -0
model.safetensors +3 -0
modeling_tinybuddy.py +169 -0
special_tokens_map.json +6 -0
tokenizer.json +0 -0
tokenizer_config.json +9 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,225 @@

+---
+license: mit
+language:
+- en
+library_name: transformers
+tags:
+- text-generation
+- tiny-lm
+- tinystories
+- educational
+- built-with-llama
+pipeline_tag: text-generation
+datasets:
+- roneneldan/TinyStories
+---
+# TinyBuddy-30M
+> ⚠️ **Educational / demo model.** TinyBuddy-30M is a from-scratch tiny GPT-style
+> language model (~30M parameters) trained for ~12 minutes on a 2-core CPU.
+> It is **not** a useful assistant — it is a working end-to-end demonstration
+> of the LM training pipeline. See the [Limitations](#limitations) section.
+## Model description
+TinyBuddy-30M is a small decoder-only Transformer language model trained on a
+slice of the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories)
+dataset. The architecture is a standard pre-norm GPT-style stack
+(LayerNorm + Causal Multi-Head Self-Attention + GELU MLP) inspired by the
+LLaMA / GPT family of decoder-only models.
+| Hyperparameter | Value |
+| --- | --- |
+| Parameters | **30,371,840** (~30.37M) |
+| Layers | 6 |
+| Attention heads | 8 |
+| Embedding dim | 256 |
+| MLP hidden dim | 1024 (mlp_ratio = 4) |
+| Context length (`block_size`) | 512 |
+| Vocab size | 50,000 (BPE; ~18k actually used) |
+| Activation | GELU |
+| Norm | LayerNorm (pre-norm) |
+| Attention | Causal SDPA |
+| Position embeddings | Learned absolute |
+| Weight tying | No (separate LM head) |
+| Precision | float32 |
+Most of the parameter budget lives in the token embedding + LM head
+(~25.6M of 30M). This is typical for small LMs.
+## Training details
+- **Data**: ~22 MB slice of TinyStories (`TinyStoriesV2-GPT4-valid.txt`,
+  27,630 short children's stories, ~5.3M BPE tokens after tokenization).
+- **Tokenizer**: byte-level BPE trained from scratch on the same slice
+  (saturated at ~18k merges; embedding padded to 50k to hit the 30M target).
+- **Optimizer**: AdamW, β=(0.9, 0.95), weight_decay=0.1, grad clip 1.0.
+- **Schedule**: cosine decay from 5e-4 → 5e-5 with 100-step linear warmup.
+- **Batch**: `batch_size=4`, `block_size=128` (≈ 512 tokens / step).
+- **Steps**: **1,500** (≈ 0.77M tokens seen — roughly **0.2% of one epoch**
+  of full TinyStories).
+- **Hardware**: 2 CPU cores, ~2 GB RAM, ~**12 minutes** wall time
+  (≈16 min including evals).
+- **Final loss**: **train ≈ 3.53 / val ≈ 3.43** (~3.55 averaged).
+  Perplexity ≈ 30 — well above the ≈ 4–5 a properly-trained TinyStories
+  model of this size reaches.
+Loss curve (training log):
+```
+step    0 | train 10.88 | val 10.88
+step  150 | train  4.83 | val  4.68
+step  300 | train  4.32 | val  4.28
+step  600 | train  3.85 | val  3.90
+step  900 | train  3.71 | val  3.77
+step 1200 | train  3.57 | val  3.55
+step 1500 | train  3.53 | val  3.43
+```
+## Usage
+This model uses **custom modeling code**, so you must pass
+`trust_remote_code=True` when loading it.
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+repo = "YOUR_USERNAME/TinyBuddy-30M"   # or local path to this folder
+tokenizer = AutoTokenizer.from_pretrained(repo)
+model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True)
+model.eval()
+prompt = "Once upon a time, there was a little girl named Lily."
+input_ids = torch.tensor([tokenizer.encode(prompt).ids
+                          if hasattr(tokenizer.encode(prompt), "ids")
+                          else tokenizer.encode(prompt)])
+# TinyBuddy ships a custom `.generate(...)` (top-k sampling). Use it directly:
+out = model.generate(input_ids, max_new_tokens=120, temperature=0.8, top_k=50)
+print(tokenizer.decode(out[0].tolist()))
+```
+If you prefer to bypass `transformers` entirely, you can use the raw
+`tokenizers` library + the included modeling file:
+```python
+from tokenizers import Tokenizer
+from safetensors.torch import load_file
+from modeling_tinybuddy import TinyGPT, GPTConfig
+import json, torch
+cfg = GPTConfig(**{k: v for k, v in json.load(open("config.json")).items()
+                   if k in GPTConfig.__dataclass_fields__})
+model = TinyGPT(cfg)
+model.load_state_dict(load_file("model.safetensors"))
+model.eval()
+tok = Tokenizer.from_file("tokenizer.json")
+ids = tok.encode("Once upon a time").ids
+out = model.generate(torch.tensor([ids]), max_new_tokens=80, temperature=0.8, top_k=50)
+print(tok.decode(out[0].tolist()))
+```
+## Example outputs
+**Prompt:** *"Once upon a time, there was a little girl named Lily."*
+> Once upon a time, there was a little girl named Lily. They loved to play
+> with their parents. One day, Tom went to the park. The sun loved the box
+> and had many friends. One day, they went for a small tree, a lot of friends.
+> He said, "What is better. But you want to find your friends, Bob?" …
+**Prompt:** *"Tom and Sam were playing in the park when"*
+> Tom and Sam were playing in the park when they were very much. Once upon a
+> time, there was a girl named The cat with her mom. They had a little girl
+> named Mia. She loved to play with her friends and play with her mom. …
+## Limitations
+**Be honest with yourself: this model is bad, and that is expected.**
+What works ✅
+- Vocabulary & register match TinyStories (short sentences, character names
+  like Tim/Lily/Spot, motifs like "Once upon a time", "the park").
+- Local grammar is mostly intact (subject–verb–object, quoted dialogue,
+  punctuation).
+- Document boundaries (`<|endoftext|>`) are respected.
+What's broken ❌
+- **No narrative coherence** across more than one or two sentences.
+- **Character drift** — characters appear, vanish, or swap names mid-story.
+- **Pronoun confusion** ("They" referring to a single girl).
+- **Ungrammatical fragments** ("She found a very happy.").
+- **Repetition loops** ("play with X. play with Y. play with Z.").
+- **No factual knowledge, no reasoning, no instruction following.**
+### Why
+| Factor | This model | A good TinyStories-class model |
+| --- | --- | --- |
+| Tokens seen | ~0.77 M | ~10⁹+ |
+| Hardware | 2 CPU cores | 1+ GPUs |
+| Wall time | ~12 min | many hours |
+| Final loss | ~3.5 | ~1.3–1.6 |
+| Perplexity | ~30 | ~4–5 |
+This is roughly **3–4 orders of magnitude less compute** than a serious
+TinyStories training run. The architecture and pipeline are correct; only
+the optimization budget is tiny.
+### Intended use
+- ✅ Educational reference for building / training / packaging a small LM.
+- ✅ Sanity-checking a training pipeline.
+- ✅ Demonstrating safetensors + Hugging Face Hub packaging.
+- ❌ **Not** for any production, user-facing, or assistive use case.
+- ❌ **Not** a source of factual information.
+- ❌ **Not** safe for inputs from untrusted users (no safety training).
+## Bias, risks, and safety
+The training data is TinyStories — synthetic children's stories generated
+by GPT-3.5/4. The model has not undergone any safety, RLHF, or
+instruction-tuning step. It may produce nonsensical, biased, or repetitive
+output, and should not be deployed in any setting where output quality or
+safety matters.
+## License
+MIT.
+## Citation
+If you use this code or model in teaching materials, please cite as:
+```
+@misc{tinybuddy30m,
+  title  = {TinyBuddy-30M: a from-scratch ~30M-parameter transformer trained on TinyStories},
+  year   = {2026},
+  note   = {Educational demonstration model.}
+}
+```
+And please cite TinyStories:
+```
+@article{eldan2023tinystories,
+  title   = {TinyStories: How Small Can Language Models Be and Still Speak Coherent English?},
+  author  = {Eldan, Ronen and Li, Yuanzhi},
+  journal = {arXiv preprint arXiv:2305.07759},
+  year    = {2023}
+}
+```
+## Built with Llama
+This model's architecture is inspired by the LLaMA family of decoder-only
+transformer language models (pre-norm, causal multi-head self-attention,
+GELU MLP). The implementation is from-scratch PyTorch and does not include
+any LLaMA weights, but follows the same overall design pattern.
+**Built with Llama.**

__init__.py ADDED Viewed

File without changes

config.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "vocab_size": 50000,
+  "block_size": 512,
+  "n_layer": 6,
+  "n_head": 8,
+  "n_embd": 256,
+  "mlp_ratio": 4,
+  "dropout": 0.0,
+  "tie_weights": false,
+  "architectures": ["TinyGPT"],
+  "auto_map": {
+    "AutoModelForCausalLM": "modeling_tinybuddy.TinyGPT"
+  },
+  "torch_dtype": "float32"
+}

configuration_tinybuddy.py ADDED Viewed

	@@ -0,0 +1,17 @@

+"""
+Configuration class for TinyBuddy-30M.
+"""
+from dataclasses import dataclass
+@dataclass
+class GPTConfig:
+    vocab_size: int = 50000
+    block_size: int = 512      # max context length
+    n_layer: int = 6
+    n_head: int = 8
+    n_embd: int = 256
+    mlp_ratio: int = 4         # hidden = mlp_ratio * n_embd
+    dropout: float = 0.0
+    tie_weights: bool = False  # False -> ~30M params; True -> ~22M

generation_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "max_new_tokens": 120,
+  "temperature": 0.8,
+  "top_k": 50,
+  "do_sample": true,
+  "eos_token_id": 50256,
+  "pad_token_id": 50256,
+  "repetition_penalty": 1.0
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:16355bf51fd05e9425e5139d8b592a754f80545e521bdb16fd2c5474dde48d19
+size 121494456

modeling_tinybuddy.py ADDED Viewed

	@@ -0,0 +1,169 @@

+"""
+Tiny GPT-style transformer (~30M params target).
+Config:
+  - 6 layers
+  - 8 heads
+  - d_model = 256
+  - vocab_size = 32000 (chosen to push param count up to ~30M, since the
+    transformer blocks themselves only have ~5M params at d_model=256/L=6;
+    the embedding + tied LM head dominates the parameter budget.)
+Parameter accounting (approx):
+  Token embedding  : 32000 * 256                    = 8,192,000
+  LM head (untied) : 256   * 32000 + 32000          = 8,224,000
+  Positional emb   : 512   * 256                    =   131,072
+  Per block (x6):
+    attn (qkv+out) : 4 * 256 * 256 + 4*256          =   263,168
+    mlp  (2 linear): 256*1024 + 1024 + 1024*256+256 =   525,568
+    2x LayerNorm   : 4 * 256                        =     1,024
+    block total                                     =   789,760
+  Blocks total     : 6 * 789,760                    = 4,738,560
+  Final LN         : 512
+  ---------------------------------------------------------
+  TOTAL ~ 21.3M (tied) or ~29.5M (untied lm head)  -> ~30M ✓
+"""
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from dataclasses import dataclass
+@dataclass
+class GPTConfig:
+    vocab_size: int = 50000
+    block_size: int = 512      # max context length
+    n_layer: int = 6
+    n_head: int = 8
+    n_embd: int = 256
+    mlp_ratio: int = 4         # hidden = 4 * n_embd
+    dropout: float = 0.0
+    tie_weights: bool = False  # False -> ~30M params; True -> ~21M
+class CausalSelfAttention(nn.Module):
+    def __init__(self, cfg: GPTConfig):
+        super().__init__()
+        assert cfg.n_embd % cfg.n_head == 0
+        self.n_head = cfg.n_head
+        self.n_embd = cfg.n_embd
+        self.head_dim = cfg.n_embd // cfg.n_head
+        self.qkv = nn.Linear(cfg.n_embd, 3 * cfg.n_embd, bias=True)
+        self.proj = nn.Linear(cfg.n_embd, cfg.n_embd, bias=True)
+        self.drop = nn.Dropout(cfg.dropout)
+        # causal mask
+        mask = torch.tril(torch.ones(cfg.block_size, cfg.block_size)).bool()
+        self.register_buffer("mask", mask, persistent=False)
+    def forward(self, x):
+        B, T, C = x.shape
+        qkv = self.qkv(x)
+        q, k, v = qkv.split(self.n_embd, dim=2)
+        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
+        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
+        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
+        # use PyTorch's fused SDPA (faster on CPU than manual)
+        y = F.scaled_dot_product_attention(q, k, v, is_causal=True,
+                                           dropout_p=self.drop.p if self.training else 0.0)
+        y = y.transpose(1, 2).contiguous().view(B, T, C)
+        return self.proj(y)
+class MLP(nn.Module):
+    def __init__(self, cfg: GPTConfig):
+        super().__init__()
+        hidden = cfg.mlp_ratio * cfg.n_embd
+        self.fc1 = nn.Linear(cfg.n_embd, hidden, bias=True)
+        self.fc2 = nn.Linear(hidden, cfg.n_embd, bias=True)
+        self.drop = nn.Dropout(cfg.dropout)
+    def forward(self, x):
+        return self.drop(self.fc2(F.gelu(self.fc1(x))))
+class Block(nn.Module):
+    def __init__(self, cfg: GPTConfig):
+        super().__init__()
+        self.ln1 = nn.LayerNorm(cfg.n_embd)
+        self.attn = CausalSelfAttention(cfg)
+        self.ln2 = nn.LayerNorm(cfg.n_embd)
+        self.mlp = MLP(cfg)
+    def forward(self, x):
+        x = x + self.attn(self.ln1(x))
+        x = x + self.mlp(self.ln2(x))
+        return x
+class TinyGPT(nn.Module):
+    def __init__(self, cfg: GPTConfig):
+        super().__init__()
+        self.cfg = cfg
+        self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.n_embd)
+        self.pos_emb = nn.Embedding(cfg.block_size, cfg.n_embd)
+        self.drop = nn.Dropout(cfg.dropout)
+        self.blocks = nn.ModuleList([Block(cfg) for _ in range(cfg.n_layer)])
+        self.ln_f = nn.LayerNorm(cfg.n_embd)
+        self.lm_head = nn.Linear(cfg.n_embd, cfg.vocab_size, bias=False)
+        if cfg.tie_weights:
+            self.lm_head.weight = self.tok_emb.weight
+        self.apply(self._init_weights)
+    @staticmethod
+    def _init_weights(m):
+        if isinstance(m, nn.Linear):
+            nn.init.normal_(m.weight, mean=0.0, std=0.02)
+            if m.bias is not None:
+                nn.init.zeros_(m.bias)
+        elif isinstance(m, nn.Embedding):
+            nn.init.normal_(m.weight, mean=0.0, std=0.02)
+    def num_params(self, non_embedding=False):
+        n = sum(p.numel() for p in self.parameters())
+        if non_embedding:
+            n -= self.tok_emb.weight.numel() + self.pos_emb.weight.numel()
+            if not self.cfg.tie_weights:
+                n -= self.lm_head.weight.numel()
+        return n
+    def forward(self, idx, targets=None):
+        B, T = idx.shape
+        assert T <= self.cfg.block_size, f"sequence length {T} > block_size {self.cfg.block_size}"
+        pos = torch.arange(T, device=idx.device)
+        x = self.tok_emb(idx) + self.pos_emb(pos)[None, :, :]
+        x = self.drop(x)
+        for blk in self.blocks:
+            x = blk(x)
+        x = self.ln_f(x)
+        logits = self.lm_head(x)
+        loss = None
+        if targets is not None:
+            loss = F.cross_entropy(logits.view(-1, logits.size(-1)),
+                                   targets.view(-1), ignore_index=-100)
+        return logits, loss
+    @torch.no_grad()
+    def generate(self, idx, max_new_tokens=100, temperature=1.0, top_k=None):
+        self.eval()
+        for _ in range(max_new_tokens):
+            idx_cond = idx if idx.size(1) <= self.cfg.block_size else idx[:, -self.cfg.block_size:]
+            logits, _ = self(idx_cond)
+            logits = logits[:, -1, :] / max(temperature, 1e-6)
+            if top_k is not None:
+                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
+                logits[logits < v[:, [-1]]] = -float("inf")
+            probs = F.softmax(logits, dim=-1)
+            next_id = torch.multinomial(probs, num_samples=1)
+            idx = torch.cat([idx, next_id], dim=1)
+        return idx
+if __name__ == "__main__":
+    cfg = GPTConfig()
+    m = TinyGPT(cfg)
+    total = m.num_params()
+    nonemb = m.num_params(non_embedding=True)
+    print(f"Total params       : {total:,}  (~{total/1e6:.2f}M)")
+    print(f"Non-embedding params: {nonemb:,}  (~{nonemb/1e6:.2f}M)")

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "unk_token": "<|unk|>",
+  "pad_token": "<|endoftext|>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "backend": "tokenizers",
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "model_max_length": 512,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": "<|unk|>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff