bd3lms-mdlm-lm1b-dep-stanza

MDLM (masked diffusion language model, block_size=1) checkpoints trained from the bd3lms repo on LM1B with dependency-tree-based token reweighting. Dependency parses were produced with Stanza.

These are MDLM models (not BD3-LM). They correspond to runs configured with algo=mdlm and the masking.use_dep_depth=True pathway added on top of the bd3lms codebase, which reweights the per-token masked-diffusion loss by a dependency-tree depth signal.

What's in this repo

Each subdirectory contains one best.ckpt (Lightning .ckpt, ~2.2 GB). The directory name encodes the run config: a{mix_alpha}_normalized_{depth_mode}_reverse{reverse}[_v2].

Folder mix_alpha depth_mode reverse Notes
a0.0_normalized_dependency_reverseTrue/ 0.0 dependency True α=0 (depth signal has no effect)
a0.0_normalized_dependency_reverseTrue_v2/ 0.0 dependency True independent rerun (_v2)
a0.125_normalized_dependency_reverseTrue/ 0.125 dependency True
a0.125_normalized_dependency_reverseTrue_v2/ 0.125 dependency True independent rerun (_v2)
a0.125_normalized_descendant_count_reverseFalse/ 0.125 descendant_count False alternate depth signal
a0.1875_normalized_dependency_reverseTrue/ 0.1875 dependency True
a0.25_normalized_dependency_reverseTrue/ 0.25 dependency True

All runs share masking.normalize_depth=True and masking.depth_temp=1.0.

Training config (shared across all runs)

  • Algorithm: algo=mdlm with algo.mdlm_loss_scale=True
  • Model: model=small (bd3lms DiT, sequence length 128)
  • Data: lm1b-wrap, wrapped to model.length=128
  • Batching: loader.global_batch_size=512, per-device batch_size=128
  • Hardware: 4× GPU single-node DDP
  • Steps: 1,000,000 (all seven runs completed)
  • Dependency trees: precomputed per-token depth features from Stanza dependency parses over LM1B

See the training script scripts/train/train_lm1b_mdlm_dep_0p125_v2.sh in the training repo for the exact CLI invocation (one per α).

Loading

These are Lightning checkpoints produced by main.py in the bd3lms repo. Load them with the matching config, e.g.:

from huggingface_hub import hf_hub_download

ckpt = hf_hub_download(
    repo_id="JianYu03/bd3lms-mdlm-lm1b-dep-stanza",
    filename="a0.125_normalized_dependency_reverseTrue/best.ckpt",
)

# Then, inside the bd3lms repo, pass it to main.py as:
#   checkpointing.resume_ckpt_path=<ckpt>  (for resume)
# or load via diffusion.Diffusion.load_from_checkpoint(ckpt, config=cfg)

Citation

If you use these checkpoints, please cite the upstream BD3-LMs paper:

@inproceedings{arriola2025block,
  title     = {Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models},
  author    = {Arriola, Marianne and Gokaslan, Aaron and Chiu, Justin T. and Yang, Zhihan and Qi, Zhixuan and Han, Jiaqi and Sahoo, Subham S. and Kuleshov, Volodymyr},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train JianYu03/bd3lms-mdlm-lm1b-dep-stanza