File size: 2,513 Bytes
0d74e60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
language: en
license: mit
tags:
  - pytorch
  - language-model
  - causal-lm
  - llama-style
  - gqa
  - rope
  - swiglu
  - rmsnorm
  - pretrained-from-scratch
datasets:
  - roneneldan/TinyStories
metrics:
  - perplexity
---

# StoryGPT

A **50M parameter** LLaMA-style decoder-only transformer pre-trained from scratch on the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset.

Built as an end-to-end CV showcase demonstrating a production-grade LLM pre-training pipeline.

## Model Description

| Component | Implementation |
|---|---|
| Attention | Grouped Query Attention (GQA) — same as LLaMA 2/3 |
| Position Encoding | Rotary Embeddings (RoPE) |
| Normalization | RMSNorm |
| Activation | SwiGLU FFN |
| Weight Tying | Embedding weight = Output head weight |
| Tokenizer | Custom BPE trained from scratch (16,384 vocab) |

**Config:**
```
vocab_size    : 16,384
context_length: 512
emb_dim       : 512
n_heads       : 8
n_kv_heads    : 4   (GQA)
n_layers      : 8
ffn_hidden    : 1,376
Parameters    : ~50M
```

## Training

- **Dataset:** TinyStories (150k stories, ~40M tokens)
- **Steps:** 20,000
- **Optimizer:** AdamW (β=(0.9, 0.95), weight_decay=0.1)
- **LR Schedule:** Cosine decay with linear warmup (500 steps), peak 3e-4 → min 3e-5
- **Gradient Clipping:** 1.0
- **Mixed Precision:** torch.cuda.amp (AMP float16)
- **Hardware:** 2× NVIDIA T4 (DataParallel) on Kaggle

## Results

| Metric | Value |
|---|---|
| Train Loss | 1.36 |
| Val Loss | 1.41 |
| **Perplexity** | **4.09** |

## Usage

```python
import torch
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer

# Download model and tokenizer
weights_path = hf_hub_download(repo_id="YOUR_HF_USERNAME/StoryGPT", filename="best_model.pt")
tok_path     = hf_hub_download(repo_id="YOUR_HF_USERNAME/StoryGPT", filename="storygpt_tokenizer.json")

tokenizer = Tokenizer.from_file(tok_path)

# Load model (copy model source files locally first)
from StoryGPT.model.gpt import GPT
from StoryGPT.config import MODEL_CONFIG

model = GPT(MODEL_CONFIG)
weights = torch.load(weights_path, map_location="cpu")
if list(weights.keys())[0].startswith("module."):
    weights = {k.replace("module.", ""): v for k, v in weights.items()}
model.load_state_dict(weights)
model.eval()
```

## Sample Output

> *Once upon a time, there was a little boy named Timmy. Timmy loved to play with his toys and go on adventures. One day, he decided to explore the forest near his house...*

## License

MIT