MingxuChai commited on
Commit
2f94a59
Β·
verified Β·
1 Parent(s): df54c0f

Overwrite PA-BDM model files

Browse files
README.md CHANGED
@@ -3,70 +3,82 @@ license: apache-2.0
3
  tags:
4
  - diffusion
5
  - vision-language
6
- - document-recognition
7
  - qwen2.5-vl
8
- - block-diffusion
9
  pipeline_tag: image-text-to-text
10
  library_name: transformers
11
  ---
12
 
 
13
  <div align="center">
14
 
15
- <h1>PA-BDM: Prefix-Adaptive Block Diffusion for Efficient Document Recognition</h1>
 
 
16
 
17
- **_Efficient Document Recognition with Prefix-Adaptive Block Diffusion_**
18
 
19
- Mingxu Chai, Ziyu Shen, Chenyu Liu, Kaidi Zhang, Jiazheng Zhang, Dingwei Zhu, Zhiheng Xi, Ruoyu Chen, Jun Long, Jihua Kang, Tao Gui, Qi Zhang
20
 
21
- [![arXiv](https://img.shields.io/badge/arXiv-PA--BDM-b31b1b.svg)](https://arxiv.org/pdf/2605.16861)
22
- <a href="https://github.com/SII-sc22mc/PA-BDM"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a>
23
- <a href="https://huggingface.co/MingxuChai/PA-BDM"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" alt="Hugging Face"></a>
24
 
25
  </div>
26
 
27
  ## πŸ“° News
28
 
29
- - **[2026.05]** πŸŽ‰ We release **PA-BDM**, a prefix-adaptive block diffusion framework for efficient document recognition.
 
 
 
 
 
 
 
 
30
 
31
  ## πŸ“„ Introduction
32
 
33
- Document recognition aims to convert document images containing text, formulas, tables, and complex layouts into structured machine-readable formats. While autoregressive vision-language models have achieved strong recognition quality, their sequential decoding process can be inefficient for long structured outputs. Block diffusion models provide a promising alternative by enabling semi-parallel generation and KV-cache reuse, but existing block diffusion approaches often rely on a fixed block granularity, which limits decoding flexibility and may introduce instability for structure-sensitive recognition tasks.
34
 
35
- **PA-BDM** addresses these limitations with a prefix-adaptive block diffusion framework. Instead of treating the block size as a fixed generation unit, PA-BDM uses it as a maximum candidate generation range and dynamically commits reliable prefixes during decoding. This design enables adaptive generation lengths, timely KV-cache reuse, and more stable recognition of structured document outputs.
36
 
37
  ## ✨ Highlights
38
 
39
- - **Prefix-Adaptive Decoding:** Dynamically commits reliable prefixes within each candidate block, allowing the effective decoding length to adapt to local prediction confidence.
40
 
41
- - **Efficient KV-cache Reuse:** Enables timely cache updates without waiting for an entire fixed block to be fully resolved.
42
 
43
- - **Structure-sensitive Document Recognition:** Designed for document recognition tasks involving text, formulas, tables, and structured outputs.
44
 
45
- - **Improved Efficiency-Accuracy Trade-off:** Achieves faster inference while maintaining strong recognition performance across document recognition benchmarks.
 
 
 
46
 
47
- ## πŸš€ Usage
48
 
49
- Please refer to the repository for installation and inference instructions:
 
 
 
 
50
 
51
- - GitHub: https://github.com/SII-sc22mc/PA-BDM
52
- - Model: https://huggingface.co/MingxuChai/PA-BDM
53
- - Paper: https://arxiv.org/pdf/2605.16861
54
 
55
  ## ❀️ Acknowledgements
56
 
57
- This project builds upon prior work and open-source resources including Qwen2.5-VL, DiffusionVL, BD3LMs, and related diffusion language modeling frameworks. We thank the authors for their valuable contributions to the community.
58
 
59
  ## πŸ“ Citation
60
-
61
  If you find our work useful, please cite our paper:
62
-
63
- ```bibtex
64
- @misc{chai2026prefixadaptiveblockdiffusionefficient,
65
- title={Prefix-Adaptive Block Diffusion for Efficient Document Recognition},
66
- author={Mingxu Chai and Ziyu Shen and Chenyu Liu and Kaidi Zhang and Jiazheng Zhang and Dingwei Zhu and Zhiheng Xi and Ruoyu Chen and Jun Long and Jihua Kang and Tao Gui and Qi Zhang},
67
- year={2026},
68
- eprint={2605.16861},
69
  archivePrefix={arXiv},
70
  primaryClass={cs.CV},
71
- url={https://arxiv.org/abs/2605.16861},
72
- }
 
 
3
  tags:
4
  - diffusion
5
  - vision-language
 
6
  - qwen2.5-vl
 
7
  pipeline_tag: image-text-to-text
8
  library_name: transformers
9
  ---
10
 
11
+
12
  <div align="center">
13
 
14
+ <h1>DiffusionVL: Translating Any Autoregressive Models into <br> Diffusion Vision Language Models</h1>
15
+
16
+ **_SOTA dVLM Performance with <5% Data & 2.0Γ— Inference Speedup!_**
17
 
18
+ [Lunbin Zeng](https://github.com/xiazhi1)<sup>1,\*</sup>, [Jingfeng Yao](https://github.com/JingfengYao)<sup>1,\*</sup>, [Bencheng Liao](https://github.com/LegendBC)<sup>1</sup>, [Hongyuan Tao](https://github.com/Hongyuan-Tao)<sup>1</sup>, [Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)<sup>1</sup>, [Xinggang Wang](https://xwcv.github.io)<sup>1, βœ‰οΈ</sup>
19
 
20
+ <sup>1</sup>Huazhong University of Science and Technology
21
 
22
+ <sup>*</sup>equal contribution, <sup>βœ‰οΈ</sup>corresponding author, xgwang@hust.edu.cn
23
+
24
+ [![arXiv](https://img.shields.io/badge/arXiv-DiffusionVL-b31b1b.svg)](https://arxiv.org/abs/2512.15713) [![Hugging Face Paper](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Paper-red)](https://huggingface.co/papers/2512.15713) <a href="https://github.com/hustvl/DiffusionVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a> <a href="https://huggingface.co/collections/hustvl/diffusionvl"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face"></a>
25
 
26
  </div>
27
 
28
  ## πŸ“° News
29
 
30
+ - **[2025.12.25]** πŸŽ„ We have completed our release plan ahead of schedule. **DiffusionVL is now fully open-sourced.** Merry Christmas to the community!
31
+ - **[2025.12.18]** πŸŽ‰ Our paper **DiffusionVL** is released on arXiv! We also release the DiffusionVL models translated from Qwen2.5VL on Hugging Face.
32
+
33
+ ## πŸš€ Release Plan
34
+ - [x] Release paper
35
+ - [x] Release DiffusionVL model weights (translated from AR-VLMs)
36
+ - [x] Release DiffusionVL model weights (translated from AR-LMs)
37
+ - [x] Release evaluation code
38
+ - [x] Release training code
39
 
40
  ## πŸ“„ Introduction
41
 
42
+ The diffusion paradigm has emerged as a promising alternative to autoregressive (AR) models, offering the potential for efficient parallel decoding. However, existing diffusion vision language models (dVLMs) largely lag behind mainstream autoregressive vision language models in performance, primarily due to the capability limitations of their base diffusion language models.
43
 
44
+ DiffusionVL bridges this gap by answering a fundamental question: ***Can we directly translate any existing autoregressive models into powerful diffusion vision language models?*** We propose a diffusion finetuning framework that "translates" any pretrained AR model into a diffusion vision language model through a simple paradigm shift and modality shift. Unlike prior dVLMs restricted by fixed generation lengths, DiffusionVL introduces a novel block decoding strategy. This allows for arbitrary-length generation and KV-cache reuse. With this integrated design, despite training with less than 5% of the training data required by previous methods, DiffusionVL translated from AR-VLMs achieves a state-of-the-art performance among exsiting dVLMs and delivers a 2.0Γ— inference speedup.
45
 
46
  ## ✨ Highlights
47
 
48
+ - **Universal Translation Framework:** Translate any AR models into dVLMs with a simple yet effective approach.
49
 
50
+ - **Superior Performance:** Achieve SOTA dVLM performance using <5% training data (738K vs 16.5M samples).
51
 
52
+ - **2.0Γ— Faster Inference:** Block decoding strategy enables KV-cache reuse and 2.0Γ— speedup over previous dVLMs.
53
 
54
+ <div align="center">
55
+ <img src="https://github.com/hustvl/DiffusionVL/raw/main/assets/benchmark.png" alt="Benchmark Image" width="800">
56
+ <img src="https://github.com/hustvl/DiffusionVL/raw/main/assets/framework.png" alt="Framework" width="800">
57
+ </div>
58
 
59
+ ## πŸš€ Get Started
60
 
61
+ | Document | Description |
62
+ | :--- | :--- |
63
+ | [Installation](https://github.com/hustvl/DiffusionVL/raw/main/docs/INSTALLATION.md) | Environment setup, data and model preparation |
64
+ | [Training & Evaluation](https://github.com/hustvl/DiffusionVL/raw/main/docs/TRAINING_EVALUATION.md) | Train and evaluate DiffusionVL models |
65
+ | [Inference](https://github.com/hustvl/DiffusionVL/raw/main/docs/INFERENCE.md) | Quick inference with pre-trained models |
66
 
 
 
 
67
 
68
  ## ❀️ Acknowledgements
69
 
70
+ This repo is mainly built on [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL), [LLaDA-V](https://github.com/ML-GSAI/LLaDA-V), [BD3LMs](https://github.com/kuleshov-group/bd3lms) and [SDAR](https://github.com/JetAstra/SDAR), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval). We thank the authors for their open-source contributions.
71
 
72
  ## πŸ“ Citation
 
73
  If you find our work useful, please cite our paper:
74
+ ```
75
+ @misc{zeng2025diffusionvltranslatingautoregressivemodels,
76
+ title={DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models},
77
+ author={Lunbin Zeng and Jingfeng Yao and Bencheng Liao and Hongyuan Tao and Wenyu Liu and Xinggang Wang},
78
+ year={2025},
79
+ eprint={2512.15713},
 
80
  archivePrefix={arXiv},
81
  primaryClass={cs.CV},
82
+ url={https://arxiv.org/abs/2512.15713},
83
+ }
84
+ ```
model-00001-of-00002.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d5ca0d4d50f7cf1555e8d12c1fcd6b0adaebbf78805443efe1ef47b290f7b56a
3
  size 4957560304
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:90d09331c2cf6cc632a2677ab85a7b262b428eb230db75b913c04f9cd8e3178b
3
  size 4957560304
model-00002-of-00002.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:03d244f5b9ae688bfdf7c050adb59fc83e98b8a2a51ee4ecda2ed74e4f33fcad
3
  size 3174117400
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f5be4c3a6e9f8e42699b1d6d27bb13fffc65532d0d1c37e3852aba99d60ba991
3
  size 3174117400