Files changed (1) hide show
  1. README.md +229 -0
README.md ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ language:
6
+ - en
7
+ - zh
8
+ tags:
9
+ - Bard-VL
10
+ - VLM
11
+ - vision-language
12
+ - multimodal
13
+ - discrete-diffusion
14
+ - masked-decoding
15
+ - custom_code
16
+ metrics:
17
+ - accuracy
18
+ ---
19
+
20
+ <h1 align="center">BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation</h1>
21
+
22
+ <p align="center">
23
+ <a href="https://github.com/cbyzju">Baoyou Chen</a><sup>1,3</sup> ·
24
+ <a href="https://github.com/1ring2rta">Hanchen Xia</a><sup>1</sup> ·
25
+ <a href="https://github.com/yhpengtu-rgb">Peng Tu</a><sup>1</sup> ·
26
+ <a href="https://github.com/Theseus-427">Haojun Shi</a><sup>1</sup> ·
27
+ <a href="https://github.com/AricGamma">Liwei Zhang</a><sup>1</sup> ·
28
+ <a href="https://github.com/weihaosky">Weihao Yuan</a><sup>4</sup> ·
29
+ <a href="https://sites.google.com/site/zhusiyucs/home">Siyu Zhu</a><sup>1,2,3,†</sup>
30
+ </p>
31
+
32
+ <p align="center">
33
+ <sup>1</sup>Shanghai Academy of AI for Science
34
+ &nbsp;&nbsp;·&nbsp;&nbsp;
35
+ <sup>2</sup>Shanghai Innovation Institute
36
+ &nbsp;&nbsp;·&nbsp;&nbsp;
37
+ <sup>3</sup>Fudan University
38
+ &nbsp;&nbsp;·&nbsp;&nbsp;
39
+ <sup>4</sup>Nanjing University
40
+ </p>
41
+
42
+ <p align="center">
43
+ 🤗 <a href="https://huggingface.co/fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct">Model</a>
44
+ &nbsp;&nbsp;|&nbsp;&nbsp;
45
+ 🏠 <a href="https://fudan-generative-vision.github.io/Bard-VL">Project Page</a>
46
+ &nbsp;&nbsp;|&nbsp;&nbsp;
47
+ 📑 <a href="https://huggingface.co/papers/2604.16514">Paper</a>
48
+ &nbsp;&nbsp;|&nbsp;&nbsp;
49
+ ✨ <a href="https://github.com/fudan-generative-vision/Bard-VL">Code</a>
50
+ </p>
51
+
52
+ # Bard-VL-B16-Mask-4B-Distil-Instruct
53
+
54
+ **Bard-VL-B16-Mask-4B-Distil-Instruct** is a 4B-class vision-language instruction model with **masked discrete-diffusion decoding**.
55
+
56
+ It is part of the **Bard-VL** family and is designed to bridge autoregressive and diffusion-style vision-language models through **Progressive Block Merging (PBM)** and **Stage-Wise Distillation (SWD)**.
57
+
58
+ Compared with a standard autoregressive VLM release style, Bard-VL emphasizes:
59
+
60
+ - **parallel block-wise decoding instead of token-by-token generation**
61
+ - **controllable response generation through blockwise denoising**
62
+
63
+ ---
64
+
65
+ ## ✨ Highlights
66
+
67
+ - **Progressive Block Merging**: Bard-VL increases the decoding block size progressively instead of jumping directly from autoregressive decoding to large-block diffusion.
68
+ - **Stage-Wise dVLM Distillation**: Bard-VL distills from a small-block diffusion anchor in the same denoising regime, reducing autoregressive-to-diffusion transfer mismatch.
69
+ - **Packed Multimodal Attention Mask**: the packed attention layout reuses shared multimodal context across clean and noisy branches to reduce redundant computation.
70
+ - **Mixed-Noise Training**: Bard-VL combines masked-token and uniform token corruption to support both token completion and visible-token revision.
71
+
72
+ ---
73
+
74
+ ## 🧭 Method Structure
75
+
76
+ <p align="center">
77
+ <img src="./model.PNG" alt="Bard-VL method overview" width="100%">
78
+ </p>
79
+
80
+ <p align="center">
81
+ <em>Pipeline, block-wise attention mask, and mixed-noise scheduler used by Bard-VL.</em>
82
+ </p>
83
+
84
+ ---
85
+
86
+ ## 📊 Evaluation Results
87
+
88
+ ### AutoRegressive Vision-Language Models
89
+
90
+ | Model | Parameters | MMMU<sub>val</sub> | MMMU-Pro<sub>standard</sub> | MME<sub>sum</sub> | RealWorldQA | MMStar | AI2D | ChartQA |
91
+ |---|---:|---:|---:|---:|---:|---:|---:|---:|
92
+ | Qwen3-VL | 4B | 47.9 | 35.0 | 2297 | 70.5 | 56.9 | 81.0 | 80.9 |
93
+ | Qwen3-VL | 8B | 53.0 | 36.0 | 2379 | 69.5 | 59.9 | 83.5 | 84.0 |
94
+ | InternVL3.5 | 4B | 57.4 | 38.2 | 2236 | 66.7 | 65.6 | 80.6 | 86.2 |
95
+ | InternVL3.5 | 8B | 57.2 | 41.0 | 2359 | 63.1 | 66.3 | 82.1 | 87.0 |
96
+
97
+ ### Diffusion Vision-Language Models
98
+
99
+ | Model | Parameters | MMMU<sub>val</sub> | MMMU-Pro<sub>standard</sub> | MME<sub>sum</sub> | RealWorldQA | MMStar | AI2D | ChartQA |
100
+ |---|---:|---:|---:|---:|---:|---:|---:|---:|
101
+ | LLaDA-V | 8B | 48.8 | 35.4 | 1998 | 63.4 | 60.4 | 77.8 | 78.2 |
102
+ | Dream-VL | 7B | 51.6 | 25.0 | 2179 | 67.7 | 59.9 | 80.4 | 86.2 |
103
+ | LaviDa | 8B | 44.2 | 28.6 | 1711 | 40.3 | 47.0 | 70.1 | 64.6 |
104
+ | SDAR-VL | 8B | 44.0 | 28.2 | 2142 | 66.1 | 53.3 | 79.6 | 82.4 |
105
+ | MMaDA | 8B | 30.2 | 21.5 | 1287 | 28.2 | 25.7 | 54.9 | 43.2 |
106
+ | Dimple-VL | 7B | 46.4 | 24.1 | 1924 | 51.9 | 47.7 | 74.2 | 58.4 |
107
+
108
+ ### Bard-VL Converted from Qwen3-VL
109
+
110
+ | Model | Parameters | MMMU<sub>val</sub> | MMMU-Pro<sub>standard</sub> | MME<sub>sum</sub> | RealWorldQA | MMStar | AI2D | ChartQA |
111
+ |---|---:|---:|---:|---:|---:|---:|---:|---:|
112
+ | Bard-VL (*B* = 32) | 2B | 42.0 | 27.9 | 2045 | 64.6 | 53.1 | 72.6 | 76.8 |
113
+ | Bard-VL (*B* = 32) | 4B | 53.0 | 34.2 | 2305 | 71.9 | 63.6 | 82.8 | 80.2 |
114
+ | Bard-VL (*B* = 32) | 8B | 54.6 | 37.6 | 2393 | 70.7 | 65.0 | 83.2 | 84.6 |
115
+
116
+ ---
117
+
118
+ ## 🛠️ Environment
119
+
120
+ Make sure your environment is aligned with the repository `requirements.txt`:
121
+
122
+ ```bash
123
+ python>=3.10
124
+ torch==2.8.0
125
+ torchvision==0.23.0
126
+ transformers==4.57.3
127
+ diffusers==0.36.0
128
+ accelerate==1.12.0
129
+ deepspeed==0.17.0
130
+ ```
131
+
132
+ Recommended runtime settings in the local repository:
133
+
134
+ ```bash
135
+ dtype = bfloat16
136
+ attn_implementation = sdpa
137
+ block_size = 16
138
+ denoising_steps = 16
139
+ ```
140
+
141
+ ---
142
+
143
+ ## 🚀 Inference Example
144
+
145
+ The official repository inference flow is implemented in `inference.py`. A minimal image understanding example aligned with that script is shown below.
146
+
147
+ ```python
148
+ import torch
149
+ from transformers import AutoProcessor
150
+
151
+ from qwen_vl_utils import process_vision_info
152
+ from nemo_automodel.components.models.bard_vl import BardVLForConditionalGeneration
153
+
154
+ model_id = "fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct"
155
+ device = "cuda" if torch.cuda.is_available() else "cpu"
156
+
157
+ model = BardVLForConditionalGeneration.from_pretrained(
158
+ model_id,
159
+ dtype=torch.bfloat16,
160
+ _attn_implementation="sdpa",
161
+ ).to(device).eval()
162
+ processor = AutoProcessor.from_pretrained(model_id)
163
+
164
+ messages = [
165
+ {
166
+ "role": "system",
167
+ "content": "You are a helpful assistant.",
168
+ },
169
+ {
170
+ "role": "user",
171
+ "content": [
172
+ {"type": "image", "image": "assets/puzzle.jpg", "min_pixels": 256 * 256, "max_pixels": 2048 * 2048},
173
+ {"type": "text", "text": "Please describe this image."},
174
+ ],
175
+ },
176
+ ]
177
+
178
+ text = processor.apply_chat_template(
179
+ messages,
180
+ tokenize=False,
181
+ add_generation_prompt=True,
182
+ )
183
+
184
+ image_inputs, video_inputs, video_kwargs = process_vision_info(
185
+ messages,
186
+ return_video_kwargs=True,
187
+ return_video_metadata=False,
188
+ image_patch_size=processor.image_processor.patch_size,
189
+ )
190
+
191
+ batch = processor(
192
+ text=[text],
193
+ images=image_inputs,
194
+ videos=video_inputs,
195
+ padding=False,
196
+ return_tensors="pt",
197
+ **video_kwargs,
198
+ ).to(device)
199
+
200
+ response_ids = model.generate(
201
+ batch,
202
+ max_new_tokens=1024,
203
+ block_size=16,
204
+ denoising_steps=16,
205
+ temperature=0.0,
206
+ top_k=0,
207
+ top_p=1.0,
208
+ remasking_strategy="low_confidence_dynamic",
209
+ confidence_threshold=0.5,
210
+ return_step_stats=False,
211
+ )
212
+
213
+ print(processor.tokenizer.batch_decode(response_ids, skip_special_tokens=True)[0].strip())
214
+ ```
215
+
216
+ For video understanding, replace the image message with the video example in `inference.py`.
217
+
218
+ ---
219
+
220
+ ## 📚 Citation
221
+
222
+ ```bibtex
223
+ @article{chen2026bard,
224
+ title={BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation},
225
+ author={Baoyou Chen and Hanchen Xia and Peng Tu and Haojun Shi and Liwei Zhang and Weihao Yuan and Siyu Zhu},
226
+ journal={arXiv preprint arXiv:2604.16514},
227
+ year={2026}
228
+ }
229
+ ```