Image-Text-to-Text
Transformers
Safetensors
diffusionvl_qwenvl
text-generation
diffusion
vision-language
document-recognition
qwen2.5-vl
block-diffusion
conversational
custom_code
Instructions to use MingxuChai/PA-BDM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MingxuChai/PA-BDM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="MingxuChai/PA-BDM", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("MingxuChai/PA-BDM", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use MingxuChai/PA-BDM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "MingxuChai/PA-BDM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MingxuChai/PA-BDM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/MingxuChai/PA-BDM
- SGLang
How to use MingxuChai/PA-BDM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "MingxuChai/PA-BDM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MingxuChai/PA-BDM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "MingxuChai/PA-BDM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MingxuChai/PA-BDM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use MingxuChai/PA-BDM with Docker Model Runner:
docker model run hf.co/MingxuChai/PA-BDM
Update README.md
Browse files
README.md
CHANGED
|
@@ -11,15 +11,8 @@ library_name: transformers
|
|
| 11 |
|
| 12 |
<div align="center">
|
| 13 |
|
| 14 |
-
<
|
| 15 |
|
| 16 |
-
**_SOTA dVLM Performance with <5% Data & 2.0Γ Inference Speedup!_**
|
| 17 |
-
|
| 18 |
-
[Lunbin Zeng](https://github.com/xiazhi1)<sup>1,\*</sup>, [Jingfeng Yao](https://github.com/JingfengYao)<sup>1,\*</sup>, [Bencheng Liao](https://github.com/LegendBC)<sup>1</sup>, [Hongyuan Tao](https://github.com/Hongyuan-Tao)<sup>1</sup>, [Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)<sup>1</sup>, [Xinggang Wang](https://xwcv.github.io)<sup>1, βοΈ</sup>
|
| 19 |
-
|
| 20 |
-
<sup>1</sup>Huazhong University of Science and Technology
|
| 21 |
-
|
| 22 |
-
<sup>*</sup>equal contribution, <sup>βοΈ</sup>corresponding author, xgwang@hust.edu.cn
|
| 23 |
|
| 24 |
[](https://arxiv.org/abs/2512.15713) [](https://huggingface.co/papers/2512.15713) <a href="https://github.com/hustvl/DiffusionVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a> <a href="https://huggingface.co/collections/hustvl/diffusionvl"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face"></a>
|
| 25 |
|
|
@@ -27,34 +20,16 @@ library_name: transformers
|
|
| 27 |
|
| 28 |
## π° News
|
| 29 |
|
| 30 |
-
- **[
|
| 31 |
-
- **[
|
| 32 |
|
| 33 |
## π Release Plan
|
| 34 |
- [x] Release paper
|
| 35 |
- [x] Release DiffusionVL model weights (translated from AR-VLMs)
|
| 36 |
-
- [x] Release DiffusionVL model weights (translated from AR-LMs)
|
| 37 |
- [x] Release evaluation code
|
| 38 |
- [x] Release training code
|
| 39 |
|
| 40 |
-
## π Introduction
|
| 41 |
-
|
| 42 |
-
The diffusion paradigm has emerged as a promising alternative to autoregressive (AR) models, offering the potential for efficient parallel decoding. However, existing diffusion vision language models (dVLMs) largely lag behind mainstream autoregressive vision language models in performance, primarily due to the capability limitations of their base diffusion language models.
|
| 43 |
-
|
| 44 |
-
DiffusionVL bridges this gap by answering a fundamental question: ***Can we directly translate any existing autoregressive models into powerful diffusion vision language models?*** We propose a diffusion finetuning framework that "translates" any pretrained AR model into a diffusion vision language model through a simple paradigm shift and modality shift. Unlike prior dVLMs restricted by fixed generation lengths, DiffusionVL introduces a novel block decoding strategy. This allows for arbitrary-length generation and KV-cache reuse. With this integrated design, despite training with less than 5% of the training data required by previous methods, DiffusionVL translated from AR-VLMs achieves a state-of-the-art performance among exsiting dVLMs and delivers a 2.0Γ inference speedup.
|
| 45 |
-
|
| 46 |
-
## β¨ Highlights
|
| 47 |
|
| 48 |
-
- **Universal Translation Framework:** Translate any AR models into dVLMs with a simple yet effective approach.
|
| 49 |
-
|
| 50 |
-
- **Superior Performance:** Achieve SOTA dVLM performance using <5% training data (738K vs 16.5M samples).
|
| 51 |
-
|
| 52 |
-
- **2.0Γ Faster Inference:** Block decoding strategy enables KV-cache reuse and 2.0Γ speedup over previous dVLMs.
|
| 53 |
-
|
| 54 |
-
<div align="center">
|
| 55 |
-
<img src="https://github.com/hustvl/DiffusionVL/raw/main/assets/benchmark.png" alt="Benchmark Image" width="800">
|
| 56 |
-
<img src="https://github.com/hustvl/DiffusionVL/raw/main/assets/framework.png" alt="Framework" width="800">
|
| 57 |
-
</div>
|
| 58 |
|
| 59 |
## π Get Started
|
| 60 |
|
|
@@ -67,7 +42,7 @@ DiffusionVL bridges this gap by answering a fundamental question: ***Can we dire
|
|
| 67 |
|
| 68 |
## β€οΈ Acknowledgements
|
| 69 |
|
| 70 |
-
This repo is mainly built on [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL), [LLaDA-V](https://github.com/ML-GSAI/LLaDA-V), [BD3LMs](https://github.com/kuleshov-group/bd3lms) and [
|
| 71 |
|
| 72 |
## π Citation
|
| 73 |
If you find our work useful, please cite our paper:
|
|
|
|
| 11 |
|
| 12 |
<div align="center">
|
| 13 |
|
| 14 |
+
<h1Prefix-Adaptive Block Diffusion for Efficient Document Recognition</h1>
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
[](https://arxiv.org/abs/2512.15713) [](https://huggingface.co/papers/2512.15713) <a href="https://github.com/hustvl/DiffusionVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a> <a href="https://huggingface.co/collections/hustvl/diffusionvl"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face"></a>
|
| 18 |
|
|
|
|
| 20 |
|
| 21 |
## π° News
|
| 22 |
|
| 23 |
+
- **[2026.05.16]** π We have completed our release plan ahead of schedule.
|
| 24 |
+
- **[2026.05.16]** π Our paper **PA-BDM** is released on arXiv! We also release the PA-BDM models translated from DiffusionVL on Hugging Face.
|
| 25 |
|
| 26 |
## π Release Plan
|
| 27 |
- [x] Release paper
|
| 28 |
- [x] Release DiffusionVL model weights (translated from AR-VLMs)
|
|
|
|
| 29 |
- [x] Release evaluation code
|
| 30 |
- [x] Release training code
|
| 31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
## π Get Started
|
| 35 |
|
|
|
|
| 42 |
|
| 43 |
## β€οΈ Acknowledgements
|
| 44 |
|
| 45 |
+
This repo is mainly built on [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL), [LLaDA-V](https://github.com/ML-GSAI/LLaDA-V), [BD3LMs](https://github.com/kuleshov-group/bd3lms) and [DiffusionVL](https://github.com/hustvl/DiffusionVL) We thank the authors for their open-source contributions.
|
| 46 |
|
| 47 |
## π Citation
|
| 48 |
If you find our work useful, please cite our paper:
|