bd3lms-mdlm-lm1b-dep-stanza
MDLM (masked diffusion language model, block_size=1) checkpoints trained from
the bd3lms repo on LM1B with
dependency-tree-based token reweighting. Dependency parses were produced
with Stanza.
These are MDLM models (not BD3-LM). They correspond to runs configured with
algo=mdlm and the masking.use_dep_depth=True pathway added on top of the
bd3lms codebase, which reweights the per-token masked-diffusion loss by a
dependency-tree depth signal.
What's in this repo
Each subdirectory contains one best.ckpt (Lightning .ckpt, ~2.2 GB). The
directory name encodes the run config: a{mix_alpha}_normalized_{depth_mode}_reverse{reverse}[_v2].
| Folder | mix_alpha |
depth_mode |
reverse |
Notes |
|---|---|---|---|---|
a0.0_normalized_dependency_reverseTrue/ |
0.0 | dependency | True | α=0 (depth signal has no effect) |
a0.0_normalized_dependency_reverseTrue_v2/ |
0.0 | dependency | True | independent rerun (_v2) |
a0.125_normalized_dependency_reverseTrue/ |
0.125 | dependency | True | |
a0.125_normalized_dependency_reverseTrue_v2/ |
0.125 | dependency | True | independent rerun (_v2) |
a0.125_normalized_descendant_count_reverseFalse/ |
0.125 | descendant_count | False | alternate depth signal |
a0.1875_normalized_dependency_reverseTrue/ |
0.1875 | dependency | True | |
a0.25_normalized_dependency_reverseTrue/ |
0.25 | dependency | True |
All runs share masking.normalize_depth=True and masking.depth_temp=1.0.
Training config (shared across all runs)
- Algorithm:
algo=mdlmwithalgo.mdlm_loss_scale=True - Model:
model=small(bd3lms DiT, sequence length 128) - Data:
lm1b-wrap, wrapped tomodel.length=128 - Batching:
loader.global_batch_size=512, per-devicebatch_size=128 - Hardware: 4× GPU single-node DDP
- Steps: 1,000,000 (all seven runs completed)
- Dependency trees: precomputed per-token depth features from Stanza dependency parses over LM1B
See the training script
scripts/train/train_lm1b_mdlm_dep_0p125_v2.sh
in the training repo for the exact CLI invocation (one per α).
Loading
These are Lightning checkpoints produced by main.py in the bd3lms repo. Load
them with the matching config, e.g.:
from huggingface_hub import hf_hub_download
ckpt = hf_hub_download(
repo_id="JianYu03/bd3lms-mdlm-lm1b-dep-stanza",
filename="a0.125_normalized_dependency_reverseTrue/best.ckpt",
)
# Then, inside the bd3lms repo, pass it to main.py as:
# checkpointing.resume_ckpt_path=<ckpt> (for resume)
# or load via diffusion.Diffusion.load_from_checkpoint(ckpt, config=cfg)
Citation
If you use these checkpoints, please cite the upstream BD3-LMs paper:
@inproceedings{arriola2025block,
title = {Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models},
author = {Arriola, Marianne and Gokaslan, Aaron and Chiu, Justin T. and Yang, Zhihan and Qi, Zhixuan and Han, Jiaqi and Sahoo, Subham S. and Kuleshov, Volodymyr},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2025}
}