bd3lms-bd3lms-lm1b-dep-stanza
BD3-LM (Block Discrete Denoising Diffusion Language Model, block_size=16)
checkpoints trained with the bd3lms
repo on LM1B with dependency-tree-based token reweighting. Dependency
parses were produced with Stanza.
These runs finetune from the companion MDLM pretrain at
JianYu03/bd3lms-mdlm-lm1b-dep-stanza
(pretrain step 850k), then finetune as BD3-LM for 150k additional steps.
What's in this repo
Each subdirectory contains one best.ckpt (Lightning .ckpt, ~2.08 GB). The
directory name encodes the run config:
bd3lm_block{block_size}_a{mix_alpha}_normalized_{depth_mode}_reverse{reverse}.
| Folder | block_size |
mix_alpha |
depth_mode |
reverse |
Notes |
|---|---|---|---|---|---|
bd3lm_block16_a0.0_normalized_dependency_reverseTrue/ |
16 | 0.0 | dependency | True | α=0 baseline (depth signal inactive) |
bd3lm_block16_a0.0625_normalized_dependency_reverseTrue/ |
16 | 0.0625 | dependency | True |
All runs share masking.normalize_depth=True and masking.depth_temp=1.0.
Training config (shared)
- Algorithm:
algo=bd3lm,block_size=16,training.resample=True - Pretrain init: MDLM checkpoint at step 850k with matching α/depth_mode
(from
JianYu03/bd3lms-mdlm-lm1b-dep-stanza) - Model:
model=small(bd3lms DiT,model.length=128,attn_backend=sdpa) - Batching:
loader.global_batch_size=512, per-devicebatch_size=64 - Hardware: 4× GPU single-node DDP
- Steps: 150,000 finetune steps
- Clip search:
algo.clip_search_widths=[0.5, 0.6, 0.7, 0.8, 0.9] - Dependency trees: precomputed per-token depth features from Stanza dependency parses over LM1B
See scripts/train/train_lm1b_bd3lm_dep.sh
in the training repo for the exact CLI invocation.
Loading
from huggingface_hub import hf_hub_download
ckpt = hf_hub_download(
repo_id="JianYu03/bd3lms-bd3lms-lm1b-dep-stanza",
filename="bd3lm_block16_a0.0625_normalized_dependency_reverseTrue/best.ckpt",
)
# Then, inside the bd3lms repo, pass it to main.py as:
# checkpointing.resume_ckpt_path=<ckpt>
# or load via diffusion.Diffusion.load_from_checkpoint(ckpt, config=cfg)
Citation
@inproceedings{arriola2025block,
title = {Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models},
author = {Arriola, Marianne and Gokaslan, Aaron and Chiu, Justin T. and Yang, Zhihan and Qi, Zhixuan and Han, Jiaqi and Sahoo, Subham S. and Kuleshov, Volodymyr},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2025}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support