JaydeepR commited on
Commit
2539abc
·
verified ·
1 Parent(s): c6b052f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Diffusion LM — TinyStories
3
+
4
+ A masked-diffusion language model trained from scratch on the
5
+ [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset.
6
+
7
+ ## Demo
8
+
9
+ ![Diffusion inference](inference.gif)
10
+
11
+ ## Architecture
12
+
13
+ | Param | Value |
14
+ |---|---|
15
+ | Parameters | ~45M |
16
+ | Hidden dim | 512 |
17
+ | Layers | 10 |
18
+ | Heads | 8 |
19
+ | FFN dim | 2048 |
20
+ | Diffusion steps T | 128 |
21
+ | Sequence length | 256 |
22
+ | Vocab size | 26,000 |
23
+
24
+ ## How it works
25
+
26
+ This is a **masked diffusion** language model. Instead of generating
27
+ tokens left-to-right like a standard LM, it starts with a fully masked
28
+ sequence and progressively unmasks tokens over T diffusion steps.
29
+
30
+ At each step the model predicts all masked tokens simultaneously, then
31
+ re-masks the least confident predictions and repeats — gradually
32
+ refining the output until the sequence is fully unmasked.
33
+
34
+ ## Training
35
+
36
+ - Dataset: 1M TinyStories examples
37
+ - Train steps: 60,000
38
+ - Effective batch size: 64 (batch 32 × grad accum 2)
39
+ - Optimizer: AdamW
40
+ - Learning rate: 2e-4 with cosine schedule and 1,000 warmup steps
41
+ - Weight decay: 0.1
42
+ - Mixed precision: bf16
43
+ - Hardware: NVIDIA RTX 3090 (24GB)
44
+
45
+ ## Evaluation
46
+
47
+ Val loss (cross-entropy on masked tokens, 20 batches of held-out TinyStories):
48
+
49
+ | Step | Val Loss |
50
+ |------|----------|
51
+ | 5,000 | 6.0313 |
52
+ | 10,000 | 5.9045 |
53
+ | 15,000 | 5.6092 |
54
+ | 20,000 | 4.4481 |
55
+ | 25,000 | 3.8447 |
56
+ | 30,000 | 3.6634 |
57
+ | 35,000 | 3.5419 |
58
+ | 40,000 | 3.3554 |
59
+ | 45,000 | 3.2779 |
60
+ | 50,000 | 3.1767 |
61
+ | 55,000 | 3.1012 |
62
+ | 60,000 | 3.1067 |
63
+
64
+ The loss drop between steps 15,000–25,000 reflects the model learning
65
+ basic language structure. Convergence around 3.10 by step 55,000.
66
+
67
+ ## Files
68
+
69
+ | File | Description |
70
+ |---|---|
71
+ | `model.pt` | Model weights (PyTorch state dict) |
72
+ | `config.json` | Architecture hyperparameters |
73
+ | `tokenizer/` | Byte-level BPE tokenizer |
74
+ | `val_loss_history.json` | Validation loss curve |
75
+ | `inference.gif` | Visualisation of progressive unmasking |