JaydeepR commited on
Commit
6e9c6d2
·
verified ·
1 Parent(s): 5fb98f3

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +97 -0
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - diffusion
5
+ - language-model
6
+ - masked-language-model
7
+ - modernbert
8
+ - text-generation
9
+ license: apache-2.0
10
+ ---
11
+
12
+ # LDM-ModernBERT — Language Diffusion Model
13
+
14
+ A language diffusion model built on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), pretrained on Project Gutenberg and fine-tuned on Open-Orca for instruction following.
15
+
16
+ Unlike autoregressive models that generate text left-to-right, this model generates text through iterative **denoising** — starting from a fully masked sequence and progressively unmasking tokens until a coherent output emerges.
17
+
18
+ ![Inference GIF](inference.gif)
19
+
20
+ ---
21
+
22
+ ## Model Details
23
+
24
+ | Property | Value |
25
+ |---|---|
26
+ | Base model | ModernBERT-base |
27
+ | Parameters | ~150M |
28
+ | Architecture | Masked Language Model (diffusion objective) |
29
+ | Pretrain data | Project Gutenberg (~6.4M chunks, seq_len=1024) |
30
+ | SFT data | Open-Orca (~4.2M Q&A pairs) |
31
+ | Pretrain steps | 30,000 |
32
+ | SFT steps | 10,000 |
33
+
34
+ ---
35
+
36
+ ## Training
37
+
38
+ ### Pretraining
39
+ The model is pretrained using a **flow-matching diffusion objective**: at each step, a random fraction `t` of tokens is masked, and the model learns to predict the original tokens. The loss is scaled by `1/t` to account for the difficulty of predicting heavily masked sequences.
40
+
41
+ - Dataset: Project Gutenberg (multilingual books)
42
+ - Final train loss: 2.92 | Final val loss: 2.96
43
+
44
+ ### SFT (Supervised Fine-Tuning)
45
+ Fine-tuned on Open-Orca instruction-response pairs. Loss is computed only on the response tokens (not the instruction), using a query mask to identify answer boundaries.
46
+
47
+ - Dataset: Open-Orca
48
+ - Final train loss: 0.84 | Final val loss: 0.97
49
+
50
+ ---
51
+
52
+ ## Inference
53
+
54
+ The model supports two generation strategies:
55
+
56
+ - **`random`** — masked tokens are randomly re-masked at each step
57
+ - **`low_confidence`** — the lowest confidence tokens are re-masked, leading to more coherent outputs
58
+
59
+ ### Quickstart
60
+
61
+ ```python
62
+ from transformers import AutoModelForMaskedLM
63
+ from safetensors.torch import load_file
64
+ import torch
65
+
66
+ # Load model
67
+ model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base")
68
+ state_dict = load_file("model.safetensors")
69
+ model.load_state_dict(state_dict, strict=False)
70
+ model.eval()
71
+ ```
72
+
73
+ Or use the provided inference scripts:
74
+
75
+ ```bash
76
+ # Interactive inference
77
+ bash inference.sh
78
+
79
+ # Generate GIF
80
+ bash create_gif.sh
81
+ ```
82
+
83
+ ---
84
+
85
+ ## Limitations
86
+
87
+ - Trained on a relatively small dataset (Project Gutenberg) with limited steps — quality is lower than production-scale models
88
+ - SFT data was truncated to 1024 tokens; very long responses may be cut off
89
+ - No RLHF or safety fine-tuning applied
90
+
91
+ ---
92
+
93
+ ## Citation
94
+
95
+ Built following the approach from:
96
+ - [Masked Diffusion Language Models](https://arxiv.org/abs/2406.07524)
97
+ - [PyTorch-Adventures — Language Diffusion Model](https://github.com/priyammaz/PyTorch-Adventures/tree/main/PyTorch%20for%20NLP/Language%20Diffusion%20Model) by [@priyammaz](https://github.com/priyammaz)