JaydeepR commited on
Commit
ef9776e
·
verified ·
1 Parent(s): 038a1c1

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +13 -5
README.md CHANGED
@@ -26,10 +26,16 @@ Unlike autoregressive models that generate text left-to-right, this model genera
26
  | Base model | ModernBERT-base |
27
  | Parameters | ~150M |
28
  | Architecture | Masked Language Model (diffusion objective) |
29
- | Pretrain data | Project Gutenberg (~6.4M chunks, seq_len=1024) |
30
  | SFT data | Open-Orca (~4.2M Q&A pairs) |
31
  | Pretrain steps | 30,000 |
32
  | SFT steps | 10,000 |
 
 
 
 
 
 
33
 
34
  ---
35
 
@@ -38,14 +44,16 @@ Unlike autoregressive models that generate text left-to-right, this model genera
38
  ### Pretraining
39
  The model is pretrained using a **flow-matching diffusion objective**: at each step, a random fraction `t` of tokens is masked, and the model learns to predict the original tokens. The loss is scaled by `1/t` to account for the difficulty of predicting heavily masked sequences.
40
 
41
- - Dataset: Project Gutenberg (multilingual books)
42
- - Final train loss: 2.92 | Final val loss: 2.96
 
43
 
44
  ### SFT (Supervised Fine-Tuning)
45
  Fine-tuned on Open-Orca instruction-response pairs. Loss is computed only on the response tokens (not the instruction), using a query mask to identify answer boundaries.
46
 
47
- - Dataset: Open-Orca
48
- - Final train loss: 0.84 | Final val loss: 0.97
 
49
 
50
  ---
51
 
 
26
  | Base model | ModernBERT-base |
27
  | Parameters | ~150M |
28
  | Architecture | Masked Language Model (diffusion objective) |
29
+ | Pretrain data | Project Gutenberg (6,400,553 train chunks, seq_len=1024) |
30
  | SFT data | Open-Orca (~4.2M Q&A pairs) |
31
  | Pretrain steps | 30,000 |
32
  | SFT steps | 10,000 |
33
+ | Effective batch size | 128 |
34
+ | Pretrain LR | 5e-5 (cosine, 1500 warmup steps) |
35
+ | SFT LR | 1e-5 (cosine, 300 warmup steps) |
36
+ | Hardware | RTX 4090 24GB |
37
+ | Pretrain time | ~20 hours |
38
+ | SFT time | ~4.3 hours |
39
 
40
  ---
41
 
 
44
  ### Pretraining
45
  The model is pretrained using a **flow-matching diffusion objective**: at each step, a random fraction `t` of tokens is masked, and the model learns to predict the original tokens. The loss is scaled by `1/t` to account for the difficulty of predicting heavily masked sequences.
46
 
47
+ - Dataset: Project Gutenberg (6,400,553 train chunks, 34,287 test chunks)
48
+ - Initial train loss: 3.887 | Initial val loss: 3.922
49
+ - Final train loss: 2.917 | Final val loss: 2.962
50
 
51
  ### SFT (Supervised Fine-Tuning)
52
  Fine-tuned on Open-Orca instruction-response pairs. Loss is computed only on the response tokens (not the instruction), using a query mask to identify answer boundaries.
53
 
54
+ - Dataset: Open-Orca (~4.2M Q&A pairs)
55
+ - Initial train loss: 1.559 | Initial val loss: 1.333
56
+ - Final train loss: 0.837 | Final val loss: 0.967
57
 
58
  ---
59