JaydeepR
/

ldm-modernbert-base-pretrain

+---
+language: en
+tags:
+  - diffusion
+  - language-model
+  - masked-language-model
+  - modernbert
+  - text-generation
+license: apache-2.0
+---
+# LDM-ModernBERT — Pretrained Language Diffusion Model
+A language diffusion model built on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), pretrained on Project Gutenberg using a masked diffusion objective.
+This is the **base pretrained checkpoint** before SFT instruction tuning. For instruction following, see [JaydeepR/ldm-modernbert-base-sft](https://huggingface.co/JaydeepR/ldm-modernbert-base-sft).
+![Inference GIF](inference.gif)
+---
+## Model Details
+| Property | Value |
+|---|---|
+| Base model | ModernBERT-base |
+| Parameters | ~150M |
+| Architecture | Masked Language Model (diffusion objective) |
+| Pretrain data | Project Gutenberg (~6.4M chunks, seq_len=1024) |
+| Pretrain steps | 30,000 |
+| Final train loss | 2.92 |
+| Final val loss | 2.96 |
+---
+## Training
+The model is pretrained using a **flow-matching diffusion objective**: at each step, a random fraction `t` of tokens is masked, and the model learns to predict the original tokens. The loss is scaled by `1/t` to account for the difficulty of predicting heavily masked sequences.
+---
+## Inference
+```python
+from transformers import AutoModelForMaskedLM
+from safetensors.torch import load_file
+import torch
+model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base")
+state_dict = load_file("model.safetensors")
+model.load_state_dict(state_dict, strict=False)
+model.eval()
+# Unconditional generation — start from all masked tokens
+seq_len = 128
+input_tokens = torch.full((1, seq_len), tokenizer.mask_token_id, dtype=torch.long)
+```
+Or use the provided scripts from the [GitHub repo](https://github.com/jaydeepraijada/Diffusion):
+```bash
+# Generate GIF (unconditional)
+bash create_gif.sh
+```
+---
+## Limitations
+- Trained on a relatively small dataset (Project Gutenberg) with limited steps
+- No instruction tuning — use the SFT checkpoint for Q&A tasks
+- Output has a literary/formal style reflecting Gutenberg training data
+---
+## Citation
+Built following the approach from:
+- [Masked Diffusion Language Models](https://arxiv.org/abs/2406.07524)
+- [PyTorch-Adventures — Language Diffusion Model](https://github.com/priyammaz/PyTorch-Adventures/tree/main/PyTorch%20for%20NLP/Language%20Diffusion%20Model) by [@priyammaz](https://github.com/priyammaz)