ModernBERT-Diffusion-Pretrained-20260119
A ModernBERT-large model pretrained as a diffusion language model on high-quality web text.
Model Description
This model extends ModernBERT with diffusion-style training:
- Variable masking ratio: 15-80% of tokens masked per sample
- Parallel prediction: All masked tokens predicted simultaneously
- Iterative refinement: Generate text by progressively unmasking
Training Details
| Parameter | Value |
|---|---|
| Base model | answerdotai/ModernBERT-large |
| Training data | Dolma3 Common Crawl (2M high-quality samples) |
| Training steps | 5000 |
| Batch size | 16 (effective) |
| Max sequence length | 8,192 tokens |
| Masking ratio | 15-80% (variable) |
| Hardware | H100 80GB |
Evaluation
- Perplexity: 2.89 (at 15% masking ratio)
- Reference: Well-trained MLM typically achieves 10-50 perplexity
Usage
As a Masked Language Model
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
model_id = "Ayushnangia/ModernBERT-Diffusion-Pretrained-20260119"
fill_mask = pipeline("fill-mask", model=model_id)
# Single mask prediction
result = fill_mask("The capital of France is [MASK].")
print(result[0]['token_str']) # Paris
For Diffusion-Style Generation
For text generation via iterative unmasking, fine-tune on instruction data first.
Intended Use
This is a pretrained checkpoint intended as a foundation for:
- Instruction fine-tuning (SFT)
- Domain adaptation
- Research into diffusion language models
Limitations
- Pretrained only - not optimized for instruction following
- Best results with fine-tuning on downstream tasks
- Limited to 8,192 token context
Citation
@misc{modernbert-diffusion,
author = {Ayush Nangia},
title = {ModernBERT Diffusion Language Model},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Ayushnangia/ModernBERT-Diffusion-Pretrained-20260119}
}
- Downloads last month
- 8
Model tree for Ayushnangia/ModernBERT-Diffusion-Pretrained-20260119
Base model
answerdotai/ModernBERT-large