bd3lms-mdlm-lm1b-dep-stanza

MDLM (masked diffusion language model, block_size=1) checkpoints trained from the bd3lms repo on LM1B with dependency-tree-based token reweighting. Dependency parses were produced with Stanza.

These are MDLM models (not BD3-LM). They correspond to runs configured with algo=mdlm and the masking.use_dep_depth=True pathway added on top of the bd3lms codebase, which reweights the per-token masked-diffusion loss by a dependency-tree depth signal.

What's in this repo

Each subdirectory contains one best.ckpt (Lightning .ckpt, ~2.2 GB). The directory name encodes the run config: a{mix_alpha}_normalized_{depth_mode}_reverse{reverse}[_v2].

Folder	`mix_alpha`	`depth_mode`	`reverse`	Notes
`a0.0_normalized_dependency_reverseTrue/`	0.0	dependency	True	α=0 (depth signal has no effect)
`a0.0_normalized_dependency_reverseTrue_v2/`	0.0	dependency	True	independent rerun (`_v2`)
`a0.125_normalized_dependency_reverseTrue/`	0.125	dependency	True
`a0.125_normalized_dependency_reverseTrue_v2/`	0.125	dependency	True	independent rerun (`_v2`)
`a0.125_normalized_descendant_count_reverseFalse/`	0.125	descendant_count	False	alternate depth signal
`a0.1875_normalized_dependency_reverseTrue/`	0.1875	dependency	True
`a0.25_normalized_dependency_reverseTrue/`	0.25	dependency	True

All runs share masking.normalize_depth=True and masking.depth_temp=1.0.

Training config (shared across all runs)

Algorithm: algo=mdlm with algo.mdlm_loss_scale=True
Model: model=small (bd3lms DiT, sequence length 128)
Data: lm1b-wrap, wrapped to model.length=128
Batching: loader.global_batch_size=512, per-device batch_size=128
Hardware: 4× GPU single-node DDP
Steps: 1,000,000 (all seven runs completed)
Dependency trees: precomputed per-token depth features from Stanza dependency parses over LM1B

See the training script scripts/train/train_lm1b_mdlm_dep_0p125_v2.sh in the training repo for the exact CLI invocation (one per α).

Loading

These are Lightning checkpoints produced by main.py in the bd3lms repo. Load them with the matching config, e.g.:

from huggingface_hub import hf_hub_download

ckpt = hf_hub_download(
    repo_id="JianYu03/bd3lms-mdlm-lm1b-dep-stanza",
    filename="a0.125_normalized_dependency_reverseTrue/best.ckpt",
)

# Then, inside the bd3lms repo, pass it to main.py as:
#   checkpointing.resume_ckpt_path=<ckpt>  (for resume)
# or load via diffusion.Diffusion.load_from_checkpoint(ckpt, config=cfg)

Citation

If you use these checkpoints, please cite the upstream BD3-LMs paper:

@inproceedings{arriola2025block,
  title     = {Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models},
  author    = {Arriola, Marianne and Gokaslan, Aaron and Chiu, Justin T. and Yang, Zhihan and Qi, Zhixuan and Han, Jiaqi and Sahoo, Subham S. and Kuleshov, Volodymyr},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

JianYu03
/

bd3lms-mdlm-lm1b-dep-stanza

bd3lms-mdlm-lm1b-dep-stanza

What's in this repo

Training config (shared across all runs)

Loading

Citation

Dataset used to train JianYu03/bd3lms-mdlm-lm1b-dep-stanza