--- language: en tags: - diffusion - language-model - masked-language-model - modernbert - text-generation license: apache-2.0 --- # LDM-ModernBERT — Pretrained Language Diffusion Model A language diffusion model built on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), pretrained on Project Gutenberg using a masked diffusion objective. This is the **base pretrained checkpoint** before SFT instruction tuning. For instruction following, see [JaydeepR/ldm-modernbert-base-sft](https://huggingface.co/JaydeepR/ldm-modernbert-base-sft). ![Inference GIF](inference.gif) --- ## Model Details | Property | Value | |---|---| | Base model | ModernBERT-base | | Parameters | ~150M | | Architecture | Masked Language Model (diffusion objective) | | Pretrain data | Project Gutenberg (6,400,553 train chunks, seq_len=1024) | | Pretrain steps | 30,000 | | Effective batch size | 128 | | Learning rate | 5e-5 (cosine, 1500 warmup steps) | | Hardware | RTX 4090 24GB | | Training time | ~20 hours | | Initial train loss | 3.887 | | Initial val loss | 3.922 | | Final train loss | 2.917 | | Final val loss | 2.962 | --- ## Training The model is pretrained using a **flow-matching diffusion objective**: at each step, a random fraction `t` of tokens is masked, and the model learns to predict the original tokens. The loss is scaled by `1/t` to account for the difficulty of predicting heavily masked sequences. --- ## Inference ```python from transformers import AutoModelForMaskedLM from safetensors.torch import load_file import torch model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base") state_dict = load_file("model.safetensors") model.load_state_dict(state_dict, strict=False) model.eval() # Unconditional generation — start from all masked tokens seq_len = 128 input_tokens = torch.full((1, seq_len), tokenizer.mask_token_id, dtype=torch.long) ``` Or use the provided scripts from the [GitHub repo](https://github.com/jaydeepraijada/Diffusion): ```bash # Generate GIF (unconditional) bash create_gif.sh ``` --- ## Limitations - Trained on a relatively small dataset (Project Gutenberg) with limited steps - No instruction tuning — use the SFT checkpoint for Q&A tasks - Output has a literary/formal style reflecting Gutenberg training data --- ## Citation Built following the approach from: - [Masked Diffusion Language Models](https://arxiv.org/abs/2406.07524) - [PyTorch-Adventures — Language Diffusion Model](https://github.com/priyammaz/PyTorch-Adventures/tree/main/PyTorch%20for%20NLP/Language%20Diffusion%20Model) by [@priyammaz](https://github.com/priyammaz)