Instructions to use Harley-ml/Dillion-1.2M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Harley-ml/Dillion-1.2M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Harley-ml/Dillion-1.2M")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Harley-ml/Dillion-1.2M") model = AutoModelForCausalLM.from_pretrained("Harley-ml/Dillion-1.2M") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Harley-ml/Dillion-1.2M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Harley-ml/Dillion-1.2M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Harley-ml/Dillion-1.2M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Harley-ml/Dillion-1.2M
- SGLang
How to use Harley-ml/Dillion-1.2M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Harley-ml/Dillion-1.2M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Harley-ml/Dillion-1.2M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Harley-ml/Dillion-1.2M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Harley-ml/Dillion-1.2M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Harley-ml/Dillion-1.2M with Docker Model Runner:
docker model run hf.co/Harley-ml/Dillion-1.2M
Update README.md
Browse files
README.md
CHANGED
|
@@ -14,9 +14,9 @@ tags:
|
|
| 14 |
- pytorch
|
| 15 |
---
|
| 16 |
|
| 17 |
-
# Dillion
|
| 18 |
|
| 19 |
-
## Summary
|
| 20 |
|
| 21 |
```
|
| 22 |
Task: Text-Generation
|
|
@@ -28,7 +28,85 @@ Framework: PyTorch, transformers
|
|
| 28 |
Author: Paul Courneya (Harley-ml)
|
| 29 |
```
|
| 30 |
|
| 31 |
-
## Description
|
| 32 |
|
| 33 |
Dillion is a 1.2M parameter language model trained on ~9B tokens of FineWeb-edu. Our goal was to make one of the best sub-1.5M parameter LMs through depth (12 layers!) and huge overtraining (~8900 tokens per parameter.)
|
| 34 |
-
Dillion beats or ties with models much larger than itself such as SupraMini-v4-2M and Tenete-8M.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
- pytorch
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# **Dillion**
|
| 18 |
|
| 19 |
+
## **Summary**
|
| 20 |
|
| 21 |
```
|
| 22 |
Task: Text-Generation
|
|
|
|
| 28 |
Author: Paul Courneya (Harley-ml)
|
| 29 |
```
|
| 30 |
|
| 31 |
+
## **Description**
|
| 32 |
|
| 33 |
Dillion is a 1.2M parameter language model trained on ~9B tokens of FineWeb-edu. Our goal was to make one of the best sub-1.5M parameter LMs through depth (12 layers!) and huge overtraining (~8900 tokens per parameter.)
|
| 34 |
+
Dillion beats or ties with models much larger than itself such as SupraMini-v4-2M and Tenete-8M.
|
| 35 |
+
|
| 36 |
+
## Architecture
|
| 37 |
+
|
| 38 |
+
Dillion-1.2M uses the Qwen3.5 architecture.
|
| 39 |
+
|
| 40 |
+
| Parameter | Value |
|
| 41 |
+
| ------------------------- | ---------------- |
|
| 42 |
+
| `NUM_HIDDEN_LAYERS` | `12` |
|
| 43 |
+
| `MAX_WINDOW_LAYERS` | `12` |
|
| 44 |
+
| `HIDDEN_SIZE` | `72` |
|
| 45 |
+
| `NUM_ATTENTION_HEADS` | `3` |
|
| 46 |
+
| `NUM_KEY_VALUE_HEADS` | `3` |
|
| 47 |
+
| `VOCAB_SIZE` | `3076` |
|
| 48 |
+
| `INTERMEDIATE_SIZE` | `288` |
|
| 49 |
+
| `ROPE_THETA` | `10000.0` |
|
| 50 |
+
| `MAX_POSITION_EMBEDDINGS` | `384` |
|
| 51 |
+
| `LAYER_TYPES` | `full_attention` |
|
| 52 |
+
|
| 53 |
+
## Training
|
| 54 |
+
|
| 55 |
+
### Hardware
|
| 56 |
+
|
| 57 |
+
We trained Dillion for 0.71 epochs on 14B (only saw ~9B) tokens of FineWeb-edu on an RTX 2060 6Gb with a batch size of 72 and a gradient accumulation of 4.
|
| 58 |
+
|
| 59 |
+
### Training Results
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
| epoch | train_loss | train_ppl | train_bpb | eval_loss | eval_ppl | eval_bpb |
|
| 63 |
+
| ------- | ---------: | --------: | --------: | --------: | -------: | -------: |
|
| 64 |
+
| 0.02368 | 4.553 | 94.917 | 1.875 | 4.492 | 89.300 | 1.850 |
|
| 65 |
+
| 0.04736 | 3.958 | 52.353 | 1.630 | 3.943 | 51.573 | 1.624 |
|
| 66 |
+
| 0.07104 | 3.763 | 43.077 | 1.550 | 3.758 | 42.863 | 1.548 |
|
| 67 |
+
| 0.09472 | 3.672 | 39.330 | 1.512 | 3.670 | 39.252 | 1.511 |
|
| 68 |
+
| 0.11840 | 3.620 | 37.338 | 1.491 | 3.620 | 37.338 | 1.491 |
|
| 69 |
+
| 0.14210 | 3.584 | 36.017 | 1.476 | 3.586 | 36.089 | 1.477 |
|
| 70 |
+
| 0.16580 | 3.557 | 35.058 | 1.465 | 3.558 | 35.093 | 1.465 |
|
| 71 |
+
| 0.18940 | 3.538 | 34.398 | 1.457 | 3.536 | 34.329 | 1.456 |
|
| 72 |
+
| 0.21310 | 3.520 | 33.784 | 1.450 | 3.520 | 33.784 | 1.450 |
|
| 73 |
+
| 0.23680 | 3.504 | 33.248 | 1.443 | 3.507 | 33.348 | 1.444 |
|
| 74 |
+
| 0.26050 | 3.494 | 32.917 | 1.439 | 3.494 | 32.917 | 1.439 |
|
| 75 |
+
| 0.28420 | 3.483 | 32.557 | 1.434 | 3.484 | 32.590 | 1.435 |
|
| 76 |
+
| 0.30780 | 3.475 | 32.298 | 1.431 | 3.475 | 32.298 | 1.431 |
|
| 77 |
+
| 0.33150 | 3.465 | 31.976 | 1.427 | 3.468 | 32.073 | 1.428 |
|
| 78 |
+
| 0.35520 | 3.459 | 31.785 | 1.425 | 3.459 | 31.785 | 1.425 |
|
| 79 |
+
| 0.37890 | 3.452 | 31.563 | 1.422 | 3.454 | 31.627 | 1.423 |
|
| 80 |
+
| 0.40260 | 3.445 | 31.343 | 1.419 | 3.447 | 31.406 | 1.420 |
|
| 81 |
+
| 0.42620 | 3.441 | 31.218 | 1.417 | 3.441 | 31.218 | 1.417 |
|
| 82 |
+
| 0.44990 | 3.437 | 31.094 | 1.416 | 3.437 | 31.094 | 1.416 |
|
| 83 |
+
| 0.47360 | 3.431 | 30.908 | 1.413 | 3.433 | 30.969 | 1.414 |
|
| 84 |
+
| 0.49730 | 3.426 | 30.753 | 1.411 | 3.428 | 30.815 | 1.412 |
|
| 85 |
+
| 0.52100 | 3.423 | 30.661 | 1.410 | 3.424 | 30.692 | 1.410 |
|
| 86 |
+
| 0.54460 | 3.419 | 30.539 | 1.408 | 3.420 | 30.569 | 1.409 |
|
| 87 |
+
| 0.56830 | 3.417 | 30.478 | 1.407 | 3.416 | 30.447 | 1.407 |
|
| 88 |
+
| 0.59200 | 3.413 | 30.356 | 1.406 | 3.413 | 30.356 | 1.406 |
|
| 89 |
+
| 0.61570 | 3.409 | 30.235 | 1.404 | 3.410 | 30.265 | 1.404 |
|
| 90 |
+
| 0.63940 | 3.404 | 30.084 | 1.402 | 3.407 | 30.175 | 1.403 |
|
| 91 |
+
| 0.66300 | 3.403 | 30.054 | 1.402 | 3.403 | 30.054 | 1.402 |
|
| 92 |
+
| 0.68670 | 3.397 | 29.874 | 1.399 | 3.401 | 29.994 | 1.401 |
|
| 93 |
+
|
| 94 |
+
## Benchmarks
|
| 95 |
+
|
| 96 |
+
| Model | Parameters |
|
| 97 |
+
| --------------- | ---------- |
|
| 98 |
+
| Dillion | 1,281,384 |
|
| 99 |
+
| SupraMini-v4-2M | 8,293,888 |
|
| 100 |
+
| Tenete-8M | 2,623,104 |
|
| 101 |
+
|
| 102 |
+
| Task | Metric | Dillion | SupraMini-v4-2M | Tenete-8M |
|
| 103 |
+
| -------- | --------------- | -------: | --------------: | ---------: |
|
| 104 |
+
| ARC Easy | acc | 0.3144 | 0.3152 | β |
|
| 105 |
+
| ARC Easy | acc_norm | 0.3136 | β | 0.3194 |
|
| 106 |
+
| BLiMP | acc | 0.6294 | 0.6070 | β |
|
| 107 |
+
| PiQA | acc | 0.5446 | β | β |
|
| 108 |
+
| PiQA | acc_norm | 0.5310 | β | 0.5571 |
|
| 109 |
+
| SWAG | acc | 0.2851 | β | β |
|
| 110 |
+
| SWAG | acc_norm | 0.3036 | β | 0.3297 |
|
| 111 |
+
| WikiText | bits_per_byte | 1.6161 | β | β |
|
| 112 |
+
| WikiText | byte_perplexity | 3.0655 | 3.1652 | β |
|