I-DLM-8B

Introspective Diffusion Language Model (8B) — a diffusion language model converted from Qwen3-8B that matches AR quality while enabling parallel token generation.

[Project Page] [Paper] [Code]

Highlights

  • First DLM to match same-scale AR quality across 15 benchmarks
  • Introspective Strided Decoding (ISD): single-pass generation + verification with p/q acceptance criterion
  • AR-compatible serving via SGLang (paged KV cache, continuous batching, CUDA graphs)
  • 2.9–4.1× higher throughput than prior DLMs at high concurrency

Results

Quality (I-DLM-8B vs baselines)

Benchmark I-DLM-8B Qwen3-8B (AR) LLaDA-2.1-mini (16B) SDAR (8B)
ARC-C 95.8 95.8 90.2 91.9
MMLU 82.4 83.5 74.5 78.6
MMLU-Pro 73.1 75.1 64.8 56.9
GPQA-D 55.6 58.9 46.0 40.2
GPQA 54.9 55.4 53.3 ---
GSM8K 95.0 96.0 89.0 91.7
MATH-500 96.8 95.8 85.0 78.6
MathBench 89.1 93.1 84.2 76.9
AIME-24 69.6 73.1 43.3 10.0
AIME-25 60.8 65.4 43.3 10.0
HumanEval 93.3 95.1 86.0 78.7
MBPP 92.2 93.4 82.1 72.0
LiveCodeBench-v6 45.7 50.3 30.4 16.6
IFEval 84.7 84.7 83.2 61.4

Usage

Note: This model checkpoint is hosted on HuggingFace for weight distribution. For inference, please use our SGLang-based ISD pipeline which implements the Introspective Strided Decoding algorithm described in the paper. Direct loading via transformers is not currently supported for reproducing paper results.

Inference via SGLang (Recommended)

# Install
git clone https://github.com/Introspective-Diffusion/I-DLM.git
cd I-DLM/inference && bash install.sh

# Launch server
python -m sglang.launch_server \
    --model-path yifanyu/I-DLM-8B \
    --trust-remote-code --tp-size 1 --dtype bfloat16 \
    --mem-fraction-static 0.85 --max-running-requests 32 \
    --attention-backend flashinfer --dllm-algorithm IDLMBlockN \
    --dllm-algorithm-config inference/configs/idlm_blockN4_config.yaml \
    --port 30000

# Generate
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Prove sqrt(2) is irrational."}],"max_tokens":4096}'

See the inference README for detailed setup, evaluation, and benchmarking.

Method

I-DLM recovers introspective consistency (AR models' inherent self-agreement) through:

  1. Strict causal masking across both masked and clean tokens
  2. Logit shift (Dream shift): hidden state at position i predicts token i+1
  3. All-masked training with auto-balanced loss: CE loss on both noisy and clean token positions, dynamically balanced

Related Models

Model HuggingFace Description
I-DLM-8B yifanyu/I-DLM-8B Converted from Qwen3-8B
I-DLM-32B yifanyu/I-DLM-32B Converted from Qwen3-32B
I-DLM-8B-LoRA yifanyu/I-DLM-8B-lora-r128 Gated LoRA adapter (rank=128) for lossless R-ISD

Citation

@article{yu2026introspective,
  title={Introspective Diffusion Language Models},
  author={Yu, Yifan and Jian, Yuqing and Wang, Junxiong and Zhou, Zhongzhu
          and Zhuang, Donglin and Fang, Xinyu and Yanamandra, Sri
          and Wu, Xiaoxia and Wu, Qingyang and Song, Shuaiwen Leon
          and Dao, Tri and Athiwaratkun, Ben and Zou, James
          and Lai, Fan and Xu, Chenfeng},
  journal={arXiv preprint arXiv:2604.11035},
  year={2026}
}
Downloads last month
2,214
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yifanyu/I-DLM-8B

Finetuned
Qwen/Qwen3-8B
Finetuned
(1447)
this model
Adapters
1 model

Collection including yifanyu/I-DLM-8B

Paper for yifanyu/I-DLM-8B