bd3lms-bd3lms-lm1b-dep-stanza

BD3-LM (Block Discrete Denoising Diffusion Language Model, block_size=16) checkpoints trained with the bd3lms repo on LM1B with dependency-tree-based token reweighting. Dependency parses were produced with Stanza.

These runs finetune from the companion MDLM pretrain at JianYu03/bd3lms-mdlm-lm1b-dep-stanza (pretrain step 850k), then finetune as BD3-LM for 150k additional steps.

What's in this repo

Each subdirectory contains one best.ckpt (Lightning .ckpt, ~2.08 GB). The directory name encodes the run config: bd3lm_block{block_size}_a{mix_alpha}_normalized_{depth_mode}_reverse{reverse}.

Folder block_size mix_alpha depth_mode reverse Notes
bd3lm_block16_a0.0_normalized_dependency_reverseTrue/ 16 0.0 dependency True α=0 baseline (depth signal inactive)
bd3lm_block16_a0.0625_normalized_dependency_reverseTrue/ 16 0.0625 dependency True

All runs share masking.normalize_depth=True and masking.depth_temp=1.0.

Training config (shared)

  • Algorithm: algo=bd3lm, block_size=16, training.resample=True
  • Pretrain init: MDLM checkpoint at step 850k with matching α/depth_mode (from JianYu03/bd3lms-mdlm-lm1b-dep-stanza)
  • Model: model=small (bd3lms DiT, model.length=128, attn_backend=sdpa)
  • Batching: loader.global_batch_size=512, per-device batch_size=64
  • Hardware: 4× GPU single-node DDP
  • Steps: 150,000 finetune steps
  • Clip search: algo.clip_search_widths=[0.5, 0.6, 0.7, 0.8, 0.9]
  • Dependency trees: precomputed per-token depth features from Stanza dependency parses over LM1B

See scripts/train/train_lm1b_bd3lm_dep.sh in the training repo for the exact CLI invocation.

Loading

from huggingface_hub import hf_hub_download

ckpt = hf_hub_download(
    repo_id="JianYu03/bd3lms-bd3lms-lm1b-dep-stanza",
    filename="bd3lm_block16_a0.0625_normalized_dependency_reverseTrue/best.ckpt",
)

# Then, inside the bd3lms repo, pass it to main.py as:
#   checkpointing.resume_ckpt_path=<ckpt>
# or load via diffusion.Diffusion.load_from_checkpoint(ckpt, config=cfg)

Citation

@inproceedings{arriola2025block,
  title     = {Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models},
  author    = {Arriola, Marianne and Gokaslan, Aaron and Chiu, Justin T. and Yang, Zhihan and Qi, Zhixuan and Han, Jiaqi and Sahoo, Subham S. and Kuleshov, Volodymyr},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train JianYu03/bd3lms-bd3lms-lm1b-dep-stanza