Harley-ml
/

Dillion-1.2M

@@ -14,9 +14,9 @@ tags:
 - pytorch
 ---
-# Dillion
-## Summary
 ```
 Task: Text-Generation
@@ -28,7 +28,85 @@ Framework: PyTorch, transformers
 Author: Paul Courneya (Harley-ml)
 ```
-## Description
 Dillion is a 1.2M parameter language model trained on ~9B tokens of FineWeb-edu. Our goal was to make one of the best sub-1.5M parameter LMs through depth (12 layers!) and huge overtraining (~8900 tokens per parameter.)
-Dillion beats or ties with models much larger than itself such as SupraMini-v4-2M and Tenete-8M.

 - pytorch
 ---
+# **Dillion**
+## **Summary**
 ```
 Task: Text-Generation
 Author: Paul Courneya (Harley-ml)
 ```
+## **Description**
 Dillion is a 1.2M parameter language model trained on ~9B tokens of FineWeb-edu. Our goal was to make one of the best sub-1.5M parameter LMs through depth (12 layers!) and huge overtraining (~8900 tokens per parameter.)
+Dillion beats or ties with models much larger than itself such as SupraMini-v4-2M and Tenete-8M.
+## Architecture
+Dillion-1.2M uses the Qwen3.5 architecture.
+| Parameter                 | Value            |
+| ------------------------- | ---------------- |
+| `NUM_HIDDEN_LAYERS`       | `12`             |
+| `MAX_WINDOW_LAYERS`       | `12`             |
+| `HIDDEN_SIZE`             | `72`             |
+| `NUM_ATTENTION_HEADS`     | `3`              |
+| `NUM_KEY_VALUE_HEADS`     | `3`              |
+| `VOCAB_SIZE`              | `3076`           |
+| `INTERMEDIATE_SIZE`       | `288`            |
+| `ROPE_THETA`              | `10000.0`        |
+| `MAX_POSITION_EMBEDDINGS` | `384`            |
+| `LAYER_TYPES`             | `full_attention` |
+## Training
+### Hardware
+We trained Dillion for 0.71 epochs on 14B (only saw ~9B) tokens of FineWeb-edu on an RTX 2060 6Gb with a batch size of 72 and a gradient accumulation of 4.
+### Training Results
+| epoch   | train_loss | train_ppl | train_bpb | eval_loss | eval_ppl | eval_bpb |
+| ------- | ---------: | --------: | --------: | --------: | -------: | -------: |
+| 0.02368 |      4.553 |    94.917 |     1.875 |     4.492 |   89.300 |    1.850 |
+| 0.04736 |      3.958 |    52.353 |     1.630 |     3.943 |   51.573 |    1.624 |
+| 0.07104 |      3.763 |    43.077 |     1.550 |     3.758 |   42.863 |    1.548 |
+| 0.09472 |      3.672 |    39.330 |     1.512 |     3.670 |   39.252 |    1.511 |
+| 0.11840 |      3.620 |    37.338 |     1.491 |     3.620 |   37.338 |    1.491 |
+| 0.14210 |      3.584 |    36.017 |     1.476 |     3.586 |   36.089 |    1.477 |
+| 0.16580 |      3.557 |    35.058 |     1.465 |     3.558 |   35.093 |    1.465 |
+| 0.18940 |      3.538 |    34.398 |     1.457 |     3.536 |   34.329 |    1.456 |
+| 0.21310 |      3.520 |    33.784 |     1.450 |     3.520 |   33.784 |    1.450 |
+| 0.23680 |      3.504 |    33.248 |     1.443 |     3.507 |   33.348 |    1.444 |
+| 0.26050 |      3.494 |    32.917 |     1.439 |     3.494 |   32.917 |    1.439 |
+| 0.28420 |      3.483 |    32.557 |     1.434 |     3.484 |   32.590 |    1.435 |
+| 0.30780 |      3.475 |    32.298 |     1.431 |     3.475 |   32.298 |    1.431 |
+| 0.33150 |      3.465 |    31.976 |     1.427 |     3.468 |   32.073 |    1.428 |
+| 0.35520 |      3.459 |    31.785 |     1.425 |     3.459 |   31.785 |    1.425 |
+| 0.37890 |      3.452 |    31.563 |     1.422 |     3.454 |   31.627 |    1.423 |
+| 0.40260 |      3.445 |    31.343 |     1.419 |     3.447 |   31.406 |    1.420 |
+| 0.42620 |      3.441 |    31.218 |     1.417 |     3.441 |   31.218 |    1.417 |
+| 0.44990 |      3.437 |    31.094 |     1.416 |     3.437 |   31.094 |    1.416 |
+| 0.47360 |      3.431 |    30.908 |     1.413 |     3.433 |   30.969 |    1.414 |
+| 0.49730 |      3.426 |    30.753 |     1.411 |     3.428 |   30.815 |    1.412 |
+| 0.52100 |      3.423 |    30.661 |     1.410 |     3.424 |   30.692 |    1.410 |
+| 0.54460 |      3.419 |    30.539 |     1.408 |     3.420 |   30.569 |    1.409 |
+| 0.56830 |      3.417 |    30.478 |     1.407 |     3.416 |   30.447 |    1.407 |
+| 0.59200 |      3.413 |    30.356 |     1.406 |     3.413 |   30.356 |    1.406 |
+| 0.61570 |      3.409 |    30.235 |     1.404 |     3.410 |   30.265 |    1.404 |
+| 0.63940 |      3.404 |    30.084 |     1.402 |     3.407 |   30.175 |    1.403 |
+| 0.66300 |      3.403 |    30.054 |     1.402 |     3.403 |   30.054 |    1.402 |
+| 0.68670 |      3.397 |    29.874 |     1.399 |     3.401 |   29.994 |    1.401 |
+## Benchmarks
+| Model           | Parameters |
+| --------------- | ---------- |
+| Dillion         | 1,281,384  |
+| SupraMini-v4-2M | 8,293,888  |
+| Tenete-8M       | 2,623,104  |
+| Task     | Metric          |  Dillion | SupraMini-v4-2M |  Tenete-8M |
+| -------- | --------------- | -------: | --------------: | ---------: |
+| ARC Easy | acc             |   0.3144 |          0.3152 |          — |
+| ARC Easy | acc_norm        |   0.3136 |               — |     0.3194 |
+| BLiMP    | acc             |   0.6294 |          0.6070 |          — |
+| PiQA     | acc             |   0.5446 |               — |          — |
+| PiQA     | acc_norm        |   0.5310 |               — |   0.5571   |
+| SWAG     | acc             |   0.2851 |               — |          — |
+| SWAG     | acc_norm        |   0.3036 |               — |     0.3297 |
+| WikiText | bits_per_byte   |   1.6161 |               — |          — |
+| WikiText | byte_perplexity |   3.0655 |          3.1652 |          — |