BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
Baoyou Chen1,3 · Hanchen Xia1 · Peng Tu1 · Haojun Shi1 · Liwei Zhang1 · Weihao Yuan4 · Siyu Zhu1,2,3,†
1Shanghai Academy of AI for Science · 2Shanghai Innovation Institute · 3Fudan University · 4Nanjing University
🤗 Model | 🏠 Project Page | 📑 Paper | ✨ Code
Bard-VL-B4-Mask-8B-Instruct
Bard-VL-B4-Mask-8B-Instruct is an 8B-class vision-language instruction model with masked discrete-diffusion decoding.
It is part of the Bard-VL family and is designed to bridge autoregressive and diffusion-style vision-language models through Progressive Block Merging (PBM) and Stage-Wise Distillation (SWD).
Compared with a standard autoregressive VLM release style, Bard-VL emphasizes:
- parallel block-wise decoding instead of token-by-token generation
- controllable response generation through blockwise denoising
✨ Highlights
- Progressive Block Merging: Bard-VL increases the decoding block size progressively instead of jumping directly from autoregressive decoding to large-block diffusion.
- Stage-Wise dVLM Distillation: Bard-VL distills from a small-block diffusion anchor in the same denoising regime, reducing autoregressive-to-diffusion transfer mismatch.
- Packed Multimodal Attention Mask: the packed attention layout reuses shared multimodal context across clean and noisy branches to reduce redundant computation.
- Mixed-Noise Training: Bard-VL combines masked-token and uniform token corruption to support both token completion and visible-token revision.
🧭 Method Structure
Pipeline, block-wise attention mask, and mixed-noise scheduler used by Bard-VL.
---📊 Evaluation Results
AutoRegressive Vision-Language Models
| Model | Parameters | MMMUval | MMMU-Prostandard | MMEsum | RealWorldQA | MMStar | AI2D | ChartQA |
|---|---|---|---|---|---|---|---|---|
| Qwen3-VL | 4B | 47.9 | 35.0 | 2297 | 70.5 | 56.9 | 81.0 | 80.9 |
| Qwen3-VL | 8B | 53.0 | 36.0 | 2379 | 69.5 | 59.9 | 83.5 | 84.0 |
| InternVL3.5 | 4B | 57.4 | 38.2 | 2236 | 66.7 | 65.6 | 80.6 | 86.2 |
| InternVL3.5 | 8B | 57.2 | 41.0 | 2359 | 63.1 | 66.3 | 82.1 | 87.0 |
Diffusion Vision-Language Models
| Model | Parameters | MMMUval | MMMU-Prostandard | MMEsum | RealWorldQA | MMStar | AI2D | ChartQA |
|---|---|---|---|---|---|---|---|---|
| LLaDA-V | 8B | 48.8 | 35.4 | 1998 | 63.4 | 60.4 | 77.8 | 78.2 |
| Dream-VL | 7B | 51.6 | 25.0 | 2179 | 67.7 | 59.9 | 80.4 | 86.2 |
| LaviDa | 8B | 44.2 | 28.6 | 1711 | 40.3 | 47.0 | 70.1 | 64.6 |
| SDAR-VL | 8B | 44.0 | 28.2 | 2142 | 66.1 | 53.3 | 79.6 | 82.4 |
| MMaDA | 8B | 30.2 | 21.5 | 1287 | 28.2 | 25.7 | 54.9 | 43.2 |
| Dimple-VL | 7B | 46.4 | 24.1 | 1924 | 51.9 | 47.7 | 74.2 | 58.4 |
Bard-VL Converted from Qwen3-VL
| Model | Parameters | MMMUval | MMMU-Prostandard | MMEsum | RealWorldQA | MMStar | AI2D | ChartQA |
|---|---|---|---|---|---|---|---|---|
| Bard-VL (B = 32) | 2B | 42.0 | 27.9 | 2045 | 64.6 | 53.1 | 72.6 | 76.8 |
| Bard-VL (B = 32) | 4B | 53.0 | 34.2 | 2305 | 71.9 | 63.6 | 82.8 | 80.2 |
| Bard-VL (B = 32) | 8B | 54.6 | 37.6 | 2393 | 70.7 | 65.0 | 83.2 | 84.6 |
🛠️ Environment
Make sure your environment is aligned with the repository requirements.txt:
python>=3.10
torch==2.8.0
torchvision==0.23.0
transformers==4.57.3
diffusers==0.36.0
accelerate==1.12.0
deepspeed==0.17.0
Recommended runtime settings in the local repository:
dtype = bfloat16
attn_implementation = sdpa
block_size = 4
denoising_steps = 4
🚀 Inference Example
The official repository inference flow is implemented in inference.py. A minimal image understanding example aligned with that script is shown below.
import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info
from nemo_automodel.components.models.bard_vl import BardVLForConditionalGeneration
model_id = "fudan-generative-ai/Bard-VL-B4-Mask-8B-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = BardVLForConditionalGeneration.from_pretrained(
model_id,
dtype=torch.bfloat16,
_attn_implementation="sdpa",
).to(device).eval()
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{
"role": "system",
"content": "You are a helpful assistant.",
},
{
"role": "user",
"content": [
{"type": "image", "image": "assets/puzzle.jpg", "min_pixels": 256 * 256, "max_pixels": 2048 * 2048},
{"type": "text", "text": "Please describe this image."},
],
},
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(
messages,
return_video_kwargs=True,
return_video_metadata=False,
image_patch_size=processor.image_processor.patch_size,
)
batch = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=False,
return_tensors="pt",
**video_kwargs,
).to(device)
response_ids = model.generate(
batch,
max_new_tokens=1024,
block_size=4,
denoising_steps=4,
temperature=0.0,
top_k=0,
top_p=1.0,
remasking_strategy="low_confidence_dynamic",
confidence_threshold=0.5,
return_step_stats=False,
)
print(processor.tokenizer.batch_decode(response_ids, skip_special_tokens=True)[0].strip())
For video understanding, replace the image message with the video example in inference.py.
📚 Citation
@article{chen2026bard,
title={BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation},
author={Baoyou Chen and Hanchen Xia and Peng Tu and Haojun Shi and Liwei Zhang and Weihao Yuan and Siyu Zhu},
journal={arXiv preprint arXiv:2604.16514},
year={2026}
}
- Downloads last month
- 31