Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -26,10 +26,16 @@ Unlike autoregressive models that generate text left-to-right, this model genera
|
|
| 26 |
| Base model | ModernBERT-base |
|
| 27 |
| Parameters | ~150M |
|
| 28 |
| Architecture | Masked Language Model (diffusion objective) |
|
| 29 |
-
| Pretrain data | Project Gutenberg (
|
| 30 |
| SFT data | Open-Orca (~4.2M Q&A pairs) |
|
| 31 |
| Pretrain steps | 30,000 |
|
| 32 |
| SFT steps | 10,000 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
---
|
| 35 |
|
|
@@ -38,14 +44,16 @@ Unlike autoregressive models that generate text left-to-right, this model genera
|
|
| 38 |
### Pretraining
|
| 39 |
The model is pretrained using a **flow-matching diffusion objective**: at each step, a random fraction `t` of tokens is masked, and the model learns to predict the original tokens. The loss is scaled by `1/t` to account for the difficulty of predicting heavily masked sequences.
|
| 40 |
|
| 41 |
-
- Dataset: Project Gutenberg (
|
| 42 |
-
-
|
|
|
|
| 43 |
|
| 44 |
### SFT (Supervised Fine-Tuning)
|
| 45 |
Fine-tuned on Open-Orca instruction-response pairs. Loss is computed only on the response tokens (not the instruction), using a query mask to identify answer boundaries.
|
| 46 |
|
| 47 |
-
- Dataset: Open-Orca
|
| 48 |
-
-
|
|
|
|
| 49 |
|
| 50 |
---
|
| 51 |
|
|
|
|
| 26 |
| Base model | ModernBERT-base |
|
| 27 |
| Parameters | ~150M |
|
| 28 |
| Architecture | Masked Language Model (diffusion objective) |
|
| 29 |
+
| Pretrain data | Project Gutenberg (6,400,553 train chunks, seq_len=1024) |
|
| 30 |
| SFT data | Open-Orca (~4.2M Q&A pairs) |
|
| 31 |
| Pretrain steps | 30,000 |
|
| 32 |
| SFT steps | 10,000 |
|
| 33 |
+
| Effective batch size | 128 |
|
| 34 |
+
| Pretrain LR | 5e-5 (cosine, 1500 warmup steps) |
|
| 35 |
+
| SFT LR | 1e-5 (cosine, 300 warmup steps) |
|
| 36 |
+
| Hardware | RTX 4090 24GB |
|
| 37 |
+
| Pretrain time | ~20 hours |
|
| 38 |
+
| SFT time | ~4.3 hours |
|
| 39 |
|
| 40 |
---
|
| 41 |
|
|
|
|
| 44 |
### Pretraining
|
| 45 |
The model is pretrained using a **flow-matching diffusion objective**: at each step, a random fraction `t` of tokens is masked, and the model learns to predict the original tokens. The loss is scaled by `1/t` to account for the difficulty of predicting heavily masked sequences.
|
| 46 |
|
| 47 |
+
- Dataset: Project Gutenberg (6,400,553 train chunks, 34,287 test chunks)
|
| 48 |
+
- Initial train loss: 3.887 | Initial val loss: 3.922
|
| 49 |
+
- Final train loss: 2.917 | Final val loss: 2.962
|
| 50 |
|
| 51 |
### SFT (Supervised Fine-Tuning)
|
| 52 |
Fine-tuned on Open-Orca instruction-response pairs. Loss is computed only on the response tokens (not the instruction), using a query mask to identify answer boundaries.
|
| 53 |
|
| 54 |
+
- Dataset: Open-Orca (~4.2M Q&A pairs)
|
| 55 |
+
- Initial train loss: 1.559 | Initial val loss: 1.333
|
| 56 |
+
- Final train loss: 0.837 | Final val loss: 0.967
|
| 57 |
|
| 58 |
---
|
| 59 |
|