JaydeepR commited on
Commit
0c90906
·
verified ·
1 Parent(s): 648a5c8

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - diffusion
5
+ - language-model
6
+ - masked-language-model
7
+ - modernbert
8
+ - text-generation
9
+ license: apache-2.0
10
+ ---
11
+
12
+ # LDM-ModernBERT — Pretrained Language Diffusion Model
13
+
14
+ A language diffusion model built on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), pretrained on Project Gutenberg using a masked diffusion objective.
15
+
16
+ This is the **base pretrained checkpoint** before SFT instruction tuning. For instruction following, see [JaydeepR/ldm-modernbert-base-sft](https://huggingface.co/JaydeepR/ldm-modernbert-base-sft).
17
+
18
+ ![Inference GIF](inference.gif)
19
+
20
+ ---
21
+
22
+ ## Model Details
23
+
24
+ | Property | Value |
25
+ |---|---|
26
+ | Base model | ModernBERT-base |
27
+ | Parameters | ~150M |
28
+ | Architecture | Masked Language Model (diffusion objective) |
29
+ | Pretrain data | Project Gutenberg (~6.4M chunks, seq_len=1024) |
30
+ | Pretrain steps | 30,000 |
31
+ | Final train loss | 2.92 |
32
+ | Final val loss | 2.96 |
33
+
34
+ ---
35
+
36
+ ## Training
37
+
38
+ The model is pretrained using a **flow-matching diffusion objective**: at each step, a random fraction `t` of tokens is masked, and the model learns to predict the original tokens. The loss is scaled by `1/t` to account for the difficulty of predicting heavily masked sequences.
39
+
40
+ ---
41
+
42
+ ## Inference
43
+
44
+ ```python
45
+ from transformers import AutoModelForMaskedLM
46
+ from safetensors.torch import load_file
47
+ import torch
48
+
49
+ model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base")
50
+ state_dict = load_file("model.safetensors")
51
+ model.load_state_dict(state_dict, strict=False)
52
+ model.eval()
53
+
54
+ # Unconditional generation — start from all masked tokens
55
+ seq_len = 128
56
+ input_tokens = torch.full((1, seq_len), tokenizer.mask_token_id, dtype=torch.long)
57
+ ```
58
+
59
+ Or use the provided scripts from the [GitHub repo](https://github.com/jaydeepraijada/Diffusion):
60
+
61
+ ```bash
62
+ # Generate GIF (unconditional)
63
+ bash create_gif.sh
64
+ ```
65
+
66
+ ---
67
+
68
+ ## Limitations
69
+
70
+ - Trained on a relatively small dataset (Project Gutenberg) with limited steps
71
+ - No instruction tuning — use the SFT checkpoint for Q&A tasks
72
+ - Output has a literary/formal style reflecting Gutenberg training data
73
+
74
+ ---
75
+
76
+ ## Citation
77
+
78
+ Built following the approach from:
79
+ - [Masked Diffusion Language Models](https://arxiv.org/abs/2406.07524)
80
+ - [PyTorch-Adventures — Language Diffusion Model](https://github.com/priyammaz/PyTorch-Adventures/tree/main/PyTorch%20for%20NLP/Language%20Diffusion%20Model) by [@priyammaz](https://github.com/priyammaz)