JaydeepR
/

ldm-modernbert-base-sft

@@ -26,10 +26,16 @@ Unlike autoregressive models that generate text left-to-right, this model genera
 | Base model | ModernBERT-base |
 | Parameters | ~150M |
 | Architecture | Masked Language Model (diffusion objective) |
-| Pretrain data | Project Gutenberg (~6.4M chunks, seq_len=1024) |
 | SFT data | Open-Orca (~4.2M Q&A pairs) |
 | Pretrain steps | 30,000 |
 | SFT steps | 10,000 |
 ---
@@ -38,14 +44,16 @@ Unlike autoregressive models that generate text left-to-right, this model genera
 ### Pretraining
 The model is pretrained using a **flow-matching diffusion objective**: at each step, a random fraction `t` of tokens is masked, and the model learns to predict the original tokens. The loss is scaled by `1/t` to account for the difficulty of predicting heavily masked sequences.
-- Dataset: Project Gutenberg (multilingual books)
-- Final train loss: 2.92 | Final val loss: 2.96
 ### SFT (Supervised Fine-Tuning)
 Fine-tuned on Open-Orca instruction-response pairs. Loss is computed only on the response tokens (not the instruction), using a query mask to identify answer boundaries.
-- Dataset: Open-Orca
-- Final train loss: 0.84 | Final val loss: 0.97
 ---

 | Base model | ModernBERT-base |
 | Parameters | ~150M |
 | Architecture | Masked Language Model (diffusion objective) |
+| Pretrain data | Project Gutenberg (6,400,553 train chunks, seq_len=1024) |
 | SFT data | Open-Orca (~4.2M Q&A pairs) |
 | Pretrain steps | 30,000 |
 | SFT steps | 10,000 |
+| Effective batch size | 128 |
+| Pretrain LR | 5e-5 (cosine, 1500 warmup steps) |
+| SFT LR | 1e-5 (cosine, 300 warmup steps) |
+| Hardware | RTX 4090 24GB |
+| Pretrain time | ~20 hours |
+| SFT time | ~4.3 hours |
 ---
 ### Pretraining
 The model is pretrained using a **flow-matching diffusion objective**: at each step, a random fraction `t` of tokens is masked, and the model learns to predict the original tokens. The loss is scaled by `1/t` to account for the difficulty of predicting heavily masked sequences.
+- Dataset: Project Gutenberg (6,400,553 train chunks, 34,287 test chunks)
+- Initial train loss: 3.887 | Initial val loss: 3.922
+- Final train loss: 2.917 | Final val loss: 2.962
 ### SFT (Supervised Fine-Tuning)
 Fine-tuned on Open-Orca instruction-response pairs. Loss is computed only on the response tokens (not the instruction), using a query mask to identify answer boundaries.
+- Dataset: Open-Orca (~4.2M Q&A pairs)
+- Initial train loss: 1.559 | Initial val loss: 1.333
+- Final train loss: 0.837 | Final val loss: 0.967
 ---