I-DLM-32B

Introspective Diffusion Language Model (32B) — a diffusion language model converted from Qwen3-32B that matches AR quality while enabling parallel token generation.

[Project Page] [Paper] [Code]

Highlights

  • Matches Qwen3-32B quality across 15 benchmarks (knowledge, math, code, instruction following)
  • Introspective Strided Decoding (ISD): single-pass generation + verification with p/q acceptance criterion
  • AR-compatible serving via SGLang (paged KV cache, continuous batching, CUDA graphs)

Results

Quality (I-DLM-32B vs baselines)

Benchmark I-DLM-32B Qwen3-32B (AR) LLaDA-2.1-flash (100B)
ARC-C 97.0 96.8 91.0
MMLU 85.8 86.0 72.4
MMLU-Pro 79.3 79.8 -
GPQA-D 68.2 68.7 49.5
GSM8K 97.3 97.3 94.5
MATH-500 96.8 96.6 82.4
AIME-24 85.4 85.0 46.7
AIME-25 72.7 72.0 -
MathBench 93.3 93.5 -
HumanEval 95.7 95.7 81.1
MBPP 93.7 93.7 -
LiveCodeBench-v6 57.2 57.7 39.3
IFEval 87.1 87.1 83.0

Usage

Note: This model checkpoint is hosted on HuggingFace for weight distribution. For inference, please use our SGLang-based ISD pipeline which implements the Introspective Strided Decoding algorithm described in the paper. Direct loading via transformers is not currently supported for reproducing paper results.

Inference via SGLang (Recommended)

# Install
git clone https://github.com/Introspective-Diffusion/I-DLM.git
cd I-DLM/inference && bash install.sh

# Launch server
python -m sglang.launch_server \
    --model-path yifanyu/I-DLM-32B \
    --trust-remote-code --tp-size 1 --dtype bfloat16 \
    --mem-fraction-static 0.85 --max-running-requests 32 \
    --attention-backend flashinfer --dllm-algorithm IDLMBlockN \
    --dllm-algorithm-config inference/configs/idlm_blockN4_config.yaml \
    --port 30000

# Generate
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Prove sqrt(2) is irrational."}],"max_tokens":4096}'

See the inference README for detailed setup, evaluation, and benchmarking.

Method

I-DLM recovers introspective consistency (AR models' inherent self-agreement) through:

  1. Strict causal masking across both masked and clean tokens
  2. Logit shift (Dream shift): hidden state at position i predicts token i+1
  3. All-masked training: CE loss on both noisy and clean token positions

Training loss: L = CE_noisy + alpha * CE_clean

Related Models

Model HuggingFace Description
I-DLM-8B yifanyu/I-DLM-8B Converted from Qwen3-8B
I-DLM-32B yifanyu/I-DLM-32B Converted from Qwen3-32B
I-DLM-8B-LoRA yifanyu/I-DLM-8B-lora-r128 Gated LoRA adapter (rank=128) for lossless R-ISD

Citation

@article{yu2026introspective,
  title={Introspective Diffusion Language Models},
  author={Yu, Yifan and Jian, Yuqing and Wang, Junxiong and Zhou, Zhongzhu
          and Zhuang, Donglin and Fang, Xinyu and Yanamandra, Sri
          and Wu, Xiaoxia and Wu, Qingyang and Song, Shuaiwen Leon
          and Dao, Tri and Athiwaratkun, Ben and Zou, James
          and Lai, Fan and Xu, Chenfeng},
  journal={arXiv preprint arXiv:2604.11035},
  year={2026}
}
Downloads last month
785
Safetensors
Model size
33B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yifanyu/I-DLM-32B

Base model

Qwen/Qwen3-32B
Finetuned
(468)
this model

Collection including yifanyu/I-DLM-32B

Paper for yifanyu/I-DLM-32B