Text Generation
Transformers
Safetensors
English
tinybuddy
tiny-lm
tinystories
educational
built-with-llama
custom_code
Instructions to use Eeppa/TinyBuddy-30M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Eeppa/TinyBuddy-30M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Eeppa/TinyBuddy-30M", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Eeppa/TinyBuddy-30M", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Eeppa/TinyBuddy-30M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Eeppa/TinyBuddy-30M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eeppa/TinyBuddy-30M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Eeppa/TinyBuddy-30M
- SGLang
How to use Eeppa/TinyBuddy-30M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Eeppa/TinyBuddy-30M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eeppa/TinyBuddy-30M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Eeppa/TinyBuddy-30M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eeppa/TinyBuddy-30M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Eeppa/TinyBuddy-30M with Docker Model Runner:
docker model run hf.co/Eeppa/TinyBuddy-30M
Upload 12 files
Browse files- README.md +225 -0
- __init__.py +0 -0
- config.json +15 -0
- configuration_tinybuddy.py +17 -0
- generation_config.json +9 -0
- merges.txt +0 -0
- model.safetensors +3 -0
- modeling_tinybuddy.py +169 -0
- special_tokens_map.json +6 -0
- tokenizer.json +0 -0
- tokenizer_config.json +9 -0
- vocab.json +0 -0
README.md
ADDED
|
@@ -0,0 +1,225 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
library_name: transformers
|
| 6 |
+
tags:
|
| 7 |
+
- text-generation
|
| 8 |
+
- tiny-lm
|
| 9 |
+
- tinystories
|
| 10 |
+
- educational
|
| 11 |
+
- built-with-llama
|
| 12 |
+
pipeline_tag: text-generation
|
| 13 |
+
datasets:
|
| 14 |
+
- roneneldan/TinyStories
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# TinyBuddy-30M
|
| 18 |
+
|
| 19 |
+
> ⚠️ **Educational / demo model.** TinyBuddy-30M is a from-scratch tiny GPT-style
|
| 20 |
+
> language model (~30M parameters) trained for ~12 minutes on a 2-core CPU.
|
| 21 |
+
> It is **not** a useful assistant — it is a working end-to-end demonstration
|
| 22 |
+
> of the LM training pipeline. See the [Limitations](#limitations) section.
|
| 23 |
+
|
| 24 |
+
## Model description
|
| 25 |
+
|
| 26 |
+
TinyBuddy-30M is a small decoder-only Transformer language model trained on a
|
| 27 |
+
slice of the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories)
|
| 28 |
+
dataset. The architecture is a standard pre-norm GPT-style stack
|
| 29 |
+
(LayerNorm + Causal Multi-Head Self-Attention + GELU MLP) inspired by the
|
| 30 |
+
LLaMA / GPT family of decoder-only models.
|
| 31 |
+
|
| 32 |
+
| Hyperparameter | Value |
|
| 33 |
+
| --- | --- |
|
| 34 |
+
| Parameters | **30,371,840** (~30.37M) |
|
| 35 |
+
| Layers | 6 |
|
| 36 |
+
| Attention heads | 8 |
|
| 37 |
+
| Embedding dim | 256 |
|
| 38 |
+
| MLP hidden dim | 1024 (mlp_ratio = 4) |
|
| 39 |
+
| Context length (`block_size`) | 512 |
|
| 40 |
+
| Vocab size | 50,000 (BPE; ~18k actually used) |
|
| 41 |
+
| Activation | GELU |
|
| 42 |
+
| Norm | LayerNorm (pre-norm) |
|
| 43 |
+
| Attention | Causal SDPA |
|
| 44 |
+
| Position embeddings | Learned absolute |
|
| 45 |
+
| Weight tying | No (separate LM head) |
|
| 46 |
+
| Precision | float32 |
|
| 47 |
+
|
| 48 |
+
Most of the parameter budget lives in the token embedding + LM head
|
| 49 |
+
(~25.6M of 30M). This is typical for small LMs.
|
| 50 |
+
|
| 51 |
+
## Training details
|
| 52 |
+
|
| 53 |
+
- **Data**: ~22 MB slice of TinyStories (`TinyStoriesV2-GPT4-valid.txt`,
|
| 54 |
+
27,630 short children's stories, ~5.3M BPE tokens after tokenization).
|
| 55 |
+
- **Tokenizer**: byte-level BPE trained from scratch on the same slice
|
| 56 |
+
(saturated at ~18k merges; embedding padded to 50k to hit the 30M target).
|
| 57 |
+
- **Optimizer**: AdamW, β=(0.9, 0.95), weight_decay=0.1, grad clip 1.0.
|
| 58 |
+
- **Schedule**: cosine decay from 5e-4 → 5e-5 with 100-step linear warmup.
|
| 59 |
+
- **Batch**: `batch_size=4`, `block_size=128` (≈ 512 tokens / step).
|
| 60 |
+
- **Steps**: **1,500** (≈ 0.77M tokens seen — roughly **0.2% of one epoch**
|
| 61 |
+
of full TinyStories).
|
| 62 |
+
- **Hardware**: 2 CPU cores, ~2 GB RAM, ~**12 minutes** wall time
|
| 63 |
+
(≈16 min including evals).
|
| 64 |
+
- **Final loss**: **train ≈ 3.53 / val ≈ 3.43** (~3.55 averaged).
|
| 65 |
+
Perplexity ≈ 30 — well above the ≈ 4–5 a properly-trained TinyStories
|
| 66 |
+
model of this size reaches.
|
| 67 |
+
|
| 68 |
+
Loss curve (training log):
|
| 69 |
+
|
| 70 |
+
```
|
| 71 |
+
step 0 | train 10.88 | val 10.88
|
| 72 |
+
step 150 | train 4.83 | val 4.68
|
| 73 |
+
step 300 | train 4.32 | val 4.28
|
| 74 |
+
step 600 | train 3.85 | val 3.90
|
| 75 |
+
step 900 | train 3.71 | val 3.77
|
| 76 |
+
step 1200 | train 3.57 | val 3.55
|
| 77 |
+
step 1500 | train 3.53 | val 3.43
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
## Usage
|
| 81 |
+
|
| 82 |
+
This model uses **custom modeling code**, so you must pass
|
| 83 |
+
`trust_remote_code=True` when loading it.
|
| 84 |
+
|
| 85 |
+
```python
|
| 86 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 87 |
+
import torch
|
| 88 |
+
|
| 89 |
+
repo = "YOUR_USERNAME/TinyBuddy-30M" # or local path to this folder
|
| 90 |
+
|
| 91 |
+
tokenizer = AutoTokenizer.from_pretrained(repo)
|
| 92 |
+
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True)
|
| 93 |
+
model.eval()
|
| 94 |
+
|
| 95 |
+
prompt = "Once upon a time, there was a little girl named Lily."
|
| 96 |
+
input_ids = torch.tensor([tokenizer.encode(prompt).ids
|
| 97 |
+
if hasattr(tokenizer.encode(prompt), "ids")
|
| 98 |
+
else tokenizer.encode(prompt)])
|
| 99 |
+
|
| 100 |
+
# TinyBuddy ships a custom `.generate(...)` (top-k sampling). Use it directly:
|
| 101 |
+
out = model.generate(input_ids, max_new_tokens=120, temperature=0.8, top_k=50)
|
| 102 |
+
print(tokenizer.decode(out[0].tolist()))
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
If you prefer to bypass `transformers` entirely, you can use the raw
|
| 106 |
+
`tokenizers` library + the included modeling file:
|
| 107 |
+
|
| 108 |
+
```python
|
| 109 |
+
from tokenizers import Tokenizer
|
| 110 |
+
from safetensors.torch import load_file
|
| 111 |
+
from modeling_tinybuddy import TinyGPT, GPTConfig
|
| 112 |
+
import json, torch
|
| 113 |
+
|
| 114 |
+
cfg = GPTConfig(**{k: v for k, v in json.load(open("config.json")).items()
|
| 115 |
+
if k in GPTConfig.__dataclass_fields__})
|
| 116 |
+
model = TinyGPT(cfg)
|
| 117 |
+
model.load_state_dict(load_file("model.safetensors"))
|
| 118 |
+
model.eval()
|
| 119 |
+
|
| 120 |
+
tok = Tokenizer.from_file("tokenizer.json")
|
| 121 |
+
ids = tok.encode("Once upon a time").ids
|
| 122 |
+
out = model.generate(torch.tensor([ids]), max_new_tokens=80, temperature=0.8, top_k=50)
|
| 123 |
+
print(tok.decode(out[0].tolist()))
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
## Example outputs
|
| 127 |
+
|
| 128 |
+
**Prompt:** *"Once upon a time, there was a little girl named Lily."*
|
| 129 |
+
|
| 130 |
+
> Once upon a time, there was a little girl named Lily. They loved to play
|
| 131 |
+
> with their parents. One day, Tom went to the park. The sun loved the box
|
| 132 |
+
> and had many friends. One day, they went for a small tree, a lot of friends.
|
| 133 |
+
> He said, "What is better. But you want to find your friends, Bob?" …
|
| 134 |
+
|
| 135 |
+
**Prompt:** *"Tom and Sam were playing in the park when"*
|
| 136 |
+
|
| 137 |
+
> Tom and Sam were playing in the park when they were very much. Once upon a
|
| 138 |
+
> time, there was a girl named The cat with her mom. They had a little girl
|
| 139 |
+
> named Mia. She loved to play with her friends and play with her mom. …
|
| 140 |
+
|
| 141 |
+
## Limitations
|
| 142 |
+
|
| 143 |
+
**Be honest with yourself: this model is bad, and that is expected.**
|
| 144 |
+
|
| 145 |
+
What works ✅
|
| 146 |
+
- Vocabulary & register match TinyStories (short sentences, character names
|
| 147 |
+
like Tim/Lily/Spot, motifs like "Once upon a time", "the park").
|
| 148 |
+
- Local grammar is mostly intact (subject–verb–object, quoted dialogue,
|
| 149 |
+
punctuation).
|
| 150 |
+
- Document boundaries (`<|endoftext|>`) are respected.
|
| 151 |
+
|
| 152 |
+
What's broken ❌
|
| 153 |
+
- **No narrative coherence** across more than one or two sentences.
|
| 154 |
+
- **Character drift** — characters appear, vanish, or swap names mid-story.
|
| 155 |
+
- **Pronoun confusion** ("They" referring to a single girl).
|
| 156 |
+
- **Ungrammatical fragments** ("She found a very happy.").
|
| 157 |
+
- **Repetition loops** ("play with X. play with Y. play with Z.").
|
| 158 |
+
- **No factual knowledge, no reasoning, no instruction following.**
|
| 159 |
+
|
| 160 |
+
### Why
|
| 161 |
+
|
| 162 |
+
| Factor | This model | A good TinyStories-class model |
|
| 163 |
+
| --- | --- | --- |
|
| 164 |
+
| Tokens seen | ~0.77 M | ~10⁹+ |
|
| 165 |
+
| Hardware | 2 CPU cores | 1+ GPUs |
|
| 166 |
+
| Wall time | ~12 min | many hours |
|
| 167 |
+
| Final loss | ~3.5 | ~1.3–1.6 |
|
| 168 |
+
| Perplexity | ~30 | ~4–5 |
|
| 169 |
+
|
| 170 |
+
This is roughly **3–4 orders of magnitude less compute** than a serious
|
| 171 |
+
TinyStories training run. The architecture and pipeline are correct; only
|
| 172 |
+
the optimization budget is tiny.
|
| 173 |
+
|
| 174 |
+
### Intended use
|
| 175 |
+
|
| 176 |
+
- ✅ Educational reference for building / training / packaging a small LM.
|
| 177 |
+
- ✅ Sanity-checking a training pipeline.
|
| 178 |
+
- ✅ Demonstrating safetensors + Hugging Face Hub packaging.
|
| 179 |
+
- ❌ **Not** for any production, user-facing, or assistive use case.
|
| 180 |
+
- ❌ **Not** a source of factual information.
|
| 181 |
+
- ❌ **Not** safe for inputs from untrusted users (no safety training).
|
| 182 |
+
|
| 183 |
+
## Bias, risks, and safety
|
| 184 |
+
|
| 185 |
+
The training data is TinyStories — synthetic children's stories generated
|
| 186 |
+
by GPT-3.5/4. The model has not undergone any safety, RLHF, or
|
| 187 |
+
instruction-tuning step. It may produce nonsensical, biased, or repetitive
|
| 188 |
+
output, and should not be deployed in any setting where output quality or
|
| 189 |
+
safety matters.
|
| 190 |
+
|
| 191 |
+
## License
|
| 192 |
+
|
| 193 |
+
MIT.
|
| 194 |
+
|
| 195 |
+
## Citation
|
| 196 |
+
|
| 197 |
+
If you use this code or model in teaching materials, please cite as:
|
| 198 |
+
|
| 199 |
+
```
|
| 200 |
+
@misc{tinybuddy30m,
|
| 201 |
+
title = {TinyBuddy-30M: a from-scratch ~30M-parameter transformer trained on TinyStories},
|
| 202 |
+
year = {2026},
|
| 203 |
+
note = {Educational demonstration model.}
|
| 204 |
+
}
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
And please cite TinyStories:
|
| 208 |
+
|
| 209 |
+
```
|
| 210 |
+
@article{eldan2023tinystories,
|
| 211 |
+
title = {TinyStories: How Small Can Language Models Be and Still Speak Coherent English?},
|
| 212 |
+
author = {Eldan, Ronen and Li, Yuanzhi},
|
| 213 |
+
journal = {arXiv preprint arXiv:2305.07759},
|
| 214 |
+
year = {2023}
|
| 215 |
+
}
|
| 216 |
+
```
|
| 217 |
+
|
| 218 |
+
## Built with Llama
|
| 219 |
+
|
| 220 |
+
This model's architecture is inspired by the LLaMA family of decoder-only
|
| 221 |
+
transformer language models (pre-norm, causal multi-head self-attention,
|
| 222 |
+
GELU MLP). The implementation is from-scratch PyTorch and does not include
|
| 223 |
+
any LLaMA weights, but follows the same overall design pattern.
|
| 224 |
+
|
| 225 |
+
**Built with Llama.**
|
__init__.py
ADDED
|
File without changes
|
config.json
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"vocab_size": 50000,
|
| 3 |
+
"block_size": 512,
|
| 4 |
+
"n_layer": 6,
|
| 5 |
+
"n_head": 8,
|
| 6 |
+
"n_embd": 256,
|
| 7 |
+
"mlp_ratio": 4,
|
| 8 |
+
"dropout": 0.0,
|
| 9 |
+
"tie_weights": false,
|
| 10 |
+
"architectures": ["TinyGPT"],
|
| 11 |
+
"auto_map": {
|
| 12 |
+
"AutoModelForCausalLM": "modeling_tinybuddy.TinyGPT"
|
| 13 |
+
},
|
| 14 |
+
"torch_dtype": "float32"
|
| 15 |
+
}
|
configuration_tinybuddy.py
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Configuration class for TinyBuddy-30M.
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
from dataclasses import dataclass
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
@dataclass
|
| 9 |
+
class GPTConfig:
|
| 10 |
+
vocab_size: int = 50000
|
| 11 |
+
block_size: int = 512 # max context length
|
| 12 |
+
n_layer: int = 6
|
| 13 |
+
n_head: int = 8
|
| 14 |
+
n_embd: int = 256
|
| 15 |
+
mlp_ratio: int = 4 # hidden = mlp_ratio * n_embd
|
| 16 |
+
dropout: float = 0.0
|
| 17 |
+
tie_weights: bool = False # False -> ~30M params; True -> ~22M
|
generation_config.json
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"max_new_tokens": 120,
|
| 3 |
+
"temperature": 0.8,
|
| 4 |
+
"top_k": 50,
|
| 5 |
+
"do_sample": true,
|
| 6 |
+
"eos_token_id": 50256,
|
| 7 |
+
"pad_token_id": 50256,
|
| 8 |
+
"repetition_penalty": 1.0
|
| 9 |
+
}
|
merges.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:16355bf51fd05e9425e5139d8b592a754f80545e521bdb16fd2c5474dde48d19
|
| 3 |
+
size 121494456
|
modeling_tinybuddy.py
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Tiny GPT-style transformer (~30M params target).
|
| 3 |
+
|
| 4 |
+
Config:
|
| 5 |
+
- 6 layers
|
| 6 |
+
- 8 heads
|
| 7 |
+
- d_model = 256
|
| 8 |
+
- vocab_size = 32000 (chosen to push param count up to ~30M, since the
|
| 9 |
+
transformer blocks themselves only have ~5M params at d_model=256/L=6;
|
| 10 |
+
the embedding + tied LM head dominates the parameter budget.)
|
| 11 |
+
|
| 12 |
+
Parameter accounting (approx):
|
| 13 |
+
Token embedding : 32000 * 256 = 8,192,000
|
| 14 |
+
LM head (untied) : 256 * 32000 + 32000 = 8,224,000
|
| 15 |
+
Positional emb : 512 * 256 = 131,072
|
| 16 |
+
Per block (x6):
|
| 17 |
+
attn (qkv+out) : 4 * 256 * 256 + 4*256 = 263,168
|
| 18 |
+
mlp (2 linear): 256*1024 + 1024 + 1024*256+256 = 525,568
|
| 19 |
+
2x LayerNorm : 4 * 256 = 1,024
|
| 20 |
+
block total = 789,760
|
| 21 |
+
Blocks total : 6 * 789,760 = 4,738,560
|
| 22 |
+
Final LN : 512
|
| 23 |
+
---------------------------------------------------------
|
| 24 |
+
TOTAL ~ 21.3M (tied) or ~29.5M (untied lm head) -> ~30M ✓
|
| 25 |
+
"""
|
| 26 |
+
|
| 27 |
+
import math
|
| 28 |
+
import torch
|
| 29 |
+
import torch.nn as nn
|
| 30 |
+
import torch.nn.functional as F
|
| 31 |
+
from dataclasses import dataclass
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
@dataclass
|
| 35 |
+
class GPTConfig:
|
| 36 |
+
vocab_size: int = 50000
|
| 37 |
+
block_size: int = 512 # max context length
|
| 38 |
+
n_layer: int = 6
|
| 39 |
+
n_head: int = 8
|
| 40 |
+
n_embd: int = 256
|
| 41 |
+
mlp_ratio: int = 4 # hidden = 4 * n_embd
|
| 42 |
+
dropout: float = 0.0
|
| 43 |
+
tie_weights: bool = False # False -> ~30M params; True -> ~21M
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
class CausalSelfAttention(nn.Module):
|
| 47 |
+
def __init__(self, cfg: GPTConfig):
|
| 48 |
+
super().__init__()
|
| 49 |
+
assert cfg.n_embd % cfg.n_head == 0
|
| 50 |
+
self.n_head = cfg.n_head
|
| 51 |
+
self.n_embd = cfg.n_embd
|
| 52 |
+
self.head_dim = cfg.n_embd // cfg.n_head
|
| 53 |
+
self.qkv = nn.Linear(cfg.n_embd, 3 * cfg.n_embd, bias=True)
|
| 54 |
+
self.proj = nn.Linear(cfg.n_embd, cfg.n_embd, bias=True)
|
| 55 |
+
self.drop = nn.Dropout(cfg.dropout)
|
| 56 |
+
# causal mask
|
| 57 |
+
mask = torch.tril(torch.ones(cfg.block_size, cfg.block_size)).bool()
|
| 58 |
+
self.register_buffer("mask", mask, persistent=False)
|
| 59 |
+
|
| 60 |
+
def forward(self, x):
|
| 61 |
+
B, T, C = x.shape
|
| 62 |
+
qkv = self.qkv(x)
|
| 63 |
+
q, k, v = qkv.split(self.n_embd, dim=2)
|
| 64 |
+
q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
|
| 65 |
+
k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
|
| 66 |
+
v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
|
| 67 |
+
# use PyTorch's fused SDPA (faster on CPU than manual)
|
| 68 |
+
y = F.scaled_dot_product_attention(q, k, v, is_causal=True,
|
| 69 |
+
dropout_p=self.drop.p if self.training else 0.0)
|
| 70 |
+
y = y.transpose(1, 2).contiguous().view(B, T, C)
|
| 71 |
+
return self.proj(y)
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
class MLP(nn.Module):
|
| 75 |
+
def __init__(self, cfg: GPTConfig):
|
| 76 |
+
super().__init__()
|
| 77 |
+
hidden = cfg.mlp_ratio * cfg.n_embd
|
| 78 |
+
self.fc1 = nn.Linear(cfg.n_embd, hidden, bias=True)
|
| 79 |
+
self.fc2 = nn.Linear(hidden, cfg.n_embd, bias=True)
|
| 80 |
+
self.drop = nn.Dropout(cfg.dropout)
|
| 81 |
+
|
| 82 |
+
def forward(self, x):
|
| 83 |
+
return self.drop(self.fc2(F.gelu(self.fc1(x))))
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
class Block(nn.Module):
|
| 87 |
+
def __init__(self, cfg: GPTConfig):
|
| 88 |
+
super().__init__()
|
| 89 |
+
self.ln1 = nn.LayerNorm(cfg.n_embd)
|
| 90 |
+
self.attn = CausalSelfAttention(cfg)
|
| 91 |
+
self.ln2 = nn.LayerNorm(cfg.n_embd)
|
| 92 |
+
self.mlp = MLP(cfg)
|
| 93 |
+
|
| 94 |
+
def forward(self, x):
|
| 95 |
+
x = x + self.attn(self.ln1(x))
|
| 96 |
+
x = x + self.mlp(self.ln2(x))
|
| 97 |
+
return x
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
class TinyGPT(nn.Module):
|
| 101 |
+
def __init__(self, cfg: GPTConfig):
|
| 102 |
+
super().__init__()
|
| 103 |
+
self.cfg = cfg
|
| 104 |
+
self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.n_embd)
|
| 105 |
+
self.pos_emb = nn.Embedding(cfg.block_size, cfg.n_embd)
|
| 106 |
+
self.drop = nn.Dropout(cfg.dropout)
|
| 107 |
+
self.blocks = nn.ModuleList([Block(cfg) for _ in range(cfg.n_layer)])
|
| 108 |
+
self.ln_f = nn.LayerNorm(cfg.n_embd)
|
| 109 |
+
self.lm_head = nn.Linear(cfg.n_embd, cfg.vocab_size, bias=False)
|
| 110 |
+
if cfg.tie_weights:
|
| 111 |
+
self.lm_head.weight = self.tok_emb.weight
|
| 112 |
+
self.apply(self._init_weights)
|
| 113 |
+
|
| 114 |
+
@staticmethod
|
| 115 |
+
def _init_weights(m):
|
| 116 |
+
if isinstance(m, nn.Linear):
|
| 117 |
+
nn.init.normal_(m.weight, mean=0.0, std=0.02)
|
| 118 |
+
if m.bias is not None:
|
| 119 |
+
nn.init.zeros_(m.bias)
|
| 120 |
+
elif isinstance(m, nn.Embedding):
|
| 121 |
+
nn.init.normal_(m.weight, mean=0.0, std=0.02)
|
| 122 |
+
|
| 123 |
+
def num_params(self, non_embedding=False):
|
| 124 |
+
n = sum(p.numel() for p in self.parameters())
|
| 125 |
+
if non_embedding:
|
| 126 |
+
n -= self.tok_emb.weight.numel() + self.pos_emb.weight.numel()
|
| 127 |
+
if not self.cfg.tie_weights:
|
| 128 |
+
n -= self.lm_head.weight.numel()
|
| 129 |
+
return n
|
| 130 |
+
|
| 131 |
+
def forward(self, idx, targets=None):
|
| 132 |
+
B, T = idx.shape
|
| 133 |
+
assert T <= self.cfg.block_size, f"sequence length {T} > block_size {self.cfg.block_size}"
|
| 134 |
+
pos = torch.arange(T, device=idx.device)
|
| 135 |
+
x = self.tok_emb(idx) + self.pos_emb(pos)[None, :, :]
|
| 136 |
+
x = self.drop(x)
|
| 137 |
+
for blk in self.blocks:
|
| 138 |
+
x = blk(x)
|
| 139 |
+
x = self.ln_f(x)
|
| 140 |
+
logits = self.lm_head(x)
|
| 141 |
+
loss = None
|
| 142 |
+
if targets is not None:
|
| 143 |
+
loss = F.cross_entropy(logits.view(-1, logits.size(-1)),
|
| 144 |
+
targets.view(-1), ignore_index=-100)
|
| 145 |
+
return logits, loss
|
| 146 |
+
|
| 147 |
+
@torch.no_grad()
|
| 148 |
+
def generate(self, idx, max_new_tokens=100, temperature=1.0, top_k=None):
|
| 149 |
+
self.eval()
|
| 150 |
+
for _ in range(max_new_tokens):
|
| 151 |
+
idx_cond = idx if idx.size(1) <= self.cfg.block_size else idx[:, -self.cfg.block_size:]
|
| 152 |
+
logits, _ = self(idx_cond)
|
| 153 |
+
logits = logits[:, -1, :] / max(temperature, 1e-6)
|
| 154 |
+
if top_k is not None:
|
| 155 |
+
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
|
| 156 |
+
logits[logits < v[:, [-1]]] = -float("inf")
|
| 157 |
+
probs = F.softmax(logits, dim=-1)
|
| 158 |
+
next_id = torch.multinomial(probs, num_samples=1)
|
| 159 |
+
idx = torch.cat([idx, next_id], dim=1)
|
| 160 |
+
return idx
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
if __name__ == "__main__":
|
| 164 |
+
cfg = GPTConfig()
|
| 165 |
+
m = TinyGPT(cfg)
|
| 166 |
+
total = m.num_params()
|
| 167 |
+
nonemb = m.num_params(non_embedding=True)
|
| 168 |
+
print(f"Total params : {total:,} (~{total/1e6:.2f}M)")
|
| 169 |
+
print(f"Non-embedding params: {nonemb:,} (~{nonemb/1e6:.2f}M)")
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": "<|endoftext|>",
|
| 3 |
+
"eos_token": "<|endoftext|>",
|
| 4 |
+
"unk_token": "<|unk|>",
|
| 5 |
+
"pad_token": "<|endoftext|>"
|
| 6 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"backend": "tokenizers",
|
| 3 |
+
"bos_token": "<|endoftext|>",
|
| 4 |
+
"eos_token": "<|endoftext|>",
|
| 5 |
+
"model_max_length": 512,
|
| 6 |
+
"pad_token": "<|endoftext|>",
|
| 7 |
+
"tokenizer_class": "TokenizersBackend",
|
| 8 |
+
"unk_token": "<|unk|>"
|
| 9 |
+
}
|
vocab.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|