Harley-ml commited on
Commit
b0dee8c
Β·
verified Β·
1 Parent(s): 22f8594

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -4
README.md CHANGED
@@ -14,9 +14,9 @@ tags:
14
  - pytorch
15
  ---
16
 
17
- # Dillion
18
 
19
- ## Summary
20
 
21
  ```
22
  Task: Text-Generation
@@ -28,7 +28,85 @@ Framework: PyTorch, transformers
28
  Author: Paul Courneya (Harley-ml)
29
  ```
30
 
31
- ## Description
32
 
33
  Dillion is a 1.2M parameter language model trained on ~9B tokens of FineWeb-edu. Our goal was to make one of the best sub-1.5M parameter LMs through depth (12 layers!) and huge overtraining (~8900 tokens per parameter.)
34
- Dillion beats or ties with models much larger than itself such as SupraMini-v4-2M and Tenete-8M.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  - pytorch
15
  ---
16
 
17
+ # **Dillion**
18
 
19
+ ## **Summary**
20
 
21
  ```
22
  Task: Text-Generation
 
28
  Author: Paul Courneya (Harley-ml)
29
  ```
30
 
31
+ ## **Description**
32
 
33
  Dillion is a 1.2M parameter language model trained on ~9B tokens of FineWeb-edu. Our goal was to make one of the best sub-1.5M parameter LMs through depth (12 layers!) and huge overtraining (~8900 tokens per parameter.)
34
+ Dillion beats or ties with models much larger than itself such as SupraMini-v4-2M and Tenete-8M.
35
+
36
+ ## Architecture
37
+
38
+ Dillion-1.2M uses the Qwen3.5 architecture.
39
+
40
+ | Parameter | Value |
41
+ | ------------------------- | ---------------- |
42
+ | `NUM_HIDDEN_LAYERS` | `12` |
43
+ | `MAX_WINDOW_LAYERS` | `12` |
44
+ | `HIDDEN_SIZE` | `72` |
45
+ | `NUM_ATTENTION_HEADS` | `3` |
46
+ | `NUM_KEY_VALUE_HEADS` | `3` |
47
+ | `VOCAB_SIZE` | `3076` |
48
+ | `INTERMEDIATE_SIZE` | `288` |
49
+ | `ROPE_THETA` | `10000.0` |
50
+ | `MAX_POSITION_EMBEDDINGS` | `384` |
51
+ | `LAYER_TYPES` | `full_attention` |
52
+
53
+ ## Training
54
+
55
+ ### Hardware
56
+
57
+ We trained Dillion for 0.71 epochs on 14B (only saw ~9B) tokens of FineWeb-edu on an RTX 2060 6Gb with a batch size of 72 and a gradient accumulation of 4.
58
+
59
+ ### Training Results
60
+
61
+
62
+ | epoch | train_loss | train_ppl | train_bpb | eval_loss | eval_ppl | eval_bpb |
63
+ | ------- | ---------: | --------: | --------: | --------: | -------: | -------: |
64
+ | 0.02368 | 4.553 | 94.917 | 1.875 | 4.492 | 89.300 | 1.850 |
65
+ | 0.04736 | 3.958 | 52.353 | 1.630 | 3.943 | 51.573 | 1.624 |
66
+ | 0.07104 | 3.763 | 43.077 | 1.550 | 3.758 | 42.863 | 1.548 |
67
+ | 0.09472 | 3.672 | 39.330 | 1.512 | 3.670 | 39.252 | 1.511 |
68
+ | 0.11840 | 3.620 | 37.338 | 1.491 | 3.620 | 37.338 | 1.491 |
69
+ | 0.14210 | 3.584 | 36.017 | 1.476 | 3.586 | 36.089 | 1.477 |
70
+ | 0.16580 | 3.557 | 35.058 | 1.465 | 3.558 | 35.093 | 1.465 |
71
+ | 0.18940 | 3.538 | 34.398 | 1.457 | 3.536 | 34.329 | 1.456 |
72
+ | 0.21310 | 3.520 | 33.784 | 1.450 | 3.520 | 33.784 | 1.450 |
73
+ | 0.23680 | 3.504 | 33.248 | 1.443 | 3.507 | 33.348 | 1.444 |
74
+ | 0.26050 | 3.494 | 32.917 | 1.439 | 3.494 | 32.917 | 1.439 |
75
+ | 0.28420 | 3.483 | 32.557 | 1.434 | 3.484 | 32.590 | 1.435 |
76
+ | 0.30780 | 3.475 | 32.298 | 1.431 | 3.475 | 32.298 | 1.431 |
77
+ | 0.33150 | 3.465 | 31.976 | 1.427 | 3.468 | 32.073 | 1.428 |
78
+ | 0.35520 | 3.459 | 31.785 | 1.425 | 3.459 | 31.785 | 1.425 |
79
+ | 0.37890 | 3.452 | 31.563 | 1.422 | 3.454 | 31.627 | 1.423 |
80
+ | 0.40260 | 3.445 | 31.343 | 1.419 | 3.447 | 31.406 | 1.420 |
81
+ | 0.42620 | 3.441 | 31.218 | 1.417 | 3.441 | 31.218 | 1.417 |
82
+ | 0.44990 | 3.437 | 31.094 | 1.416 | 3.437 | 31.094 | 1.416 |
83
+ | 0.47360 | 3.431 | 30.908 | 1.413 | 3.433 | 30.969 | 1.414 |
84
+ | 0.49730 | 3.426 | 30.753 | 1.411 | 3.428 | 30.815 | 1.412 |
85
+ | 0.52100 | 3.423 | 30.661 | 1.410 | 3.424 | 30.692 | 1.410 |
86
+ | 0.54460 | 3.419 | 30.539 | 1.408 | 3.420 | 30.569 | 1.409 |
87
+ | 0.56830 | 3.417 | 30.478 | 1.407 | 3.416 | 30.447 | 1.407 |
88
+ | 0.59200 | 3.413 | 30.356 | 1.406 | 3.413 | 30.356 | 1.406 |
89
+ | 0.61570 | 3.409 | 30.235 | 1.404 | 3.410 | 30.265 | 1.404 |
90
+ | 0.63940 | 3.404 | 30.084 | 1.402 | 3.407 | 30.175 | 1.403 |
91
+ | 0.66300 | 3.403 | 30.054 | 1.402 | 3.403 | 30.054 | 1.402 |
92
+ | 0.68670 | 3.397 | 29.874 | 1.399 | 3.401 | 29.994 | 1.401 |
93
+
94
+ ## Benchmarks
95
+
96
+ | Model | Parameters |
97
+ | --------------- | ---------- |
98
+ | Dillion | 1,281,384 |
99
+ | SupraMini-v4-2M | 8,293,888 |
100
+ | Tenete-8M | 2,623,104 |
101
+
102
+ | Task | Metric | Dillion | SupraMini-v4-2M | Tenete-8M |
103
+ | -------- | --------------- | -------: | --------------: | ---------: |
104
+ | ARC Easy | acc | 0.3144 | 0.3152 | β€” |
105
+ | ARC Easy | acc_norm | 0.3136 | β€” | 0.3194 |
106
+ | BLiMP | acc | 0.6294 | 0.6070 | β€” |
107
+ | PiQA | acc | 0.5446 | β€” | β€” |
108
+ | PiQA | acc_norm | 0.5310 | β€” | 0.5571 |
109
+ | SWAG | acc | 0.2851 | β€” | β€” |
110
+ | SWAG | acc_norm | 0.3036 | β€” | 0.3297 |
111
+ | WikiText | bits_per_byte | 1.6161 | β€” | β€” |
112
+ | WikiText | byte_perplexity | 3.0655 | 3.1652 | β€” |