| --- |
| license: mit |
| library_name: transformers |
| pipeline_tag: image-text-to-text |
| language: |
| - en |
| - zh |
| tags: |
| - Bard-VL |
| - VLM |
| - vision-language |
| - multimodal |
| - discrete-diffusion |
| - masked-decoding |
| - custom_code |
| metrics: |
| - accuracy |
| --- |
| |
| <h1 align="center">BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation</h1> |
|
|
| <p align="center"> |
| <a href="https://github.com/cbyzju">Baoyou Chen</a><sup>1,3</sup> ยท |
| <a href="https://github.com/1ring2rta">Hanchen Xia</a><sup>1</sup> ยท |
| <a href="https://github.com/yhpengtu-rgb">Peng Tu</a><sup>1</sup> ยท |
| <a href="https://github.com/Theseus-427">Haojun Shi</a><sup>1</sup> ยท |
| <a href="https://github.com/AricGamma">Liwei Zhang</a><sup>1</sup> ยท |
| <a href="https://github.com/weihaosky">Weihao Yuan</a><sup>4</sup> ยท |
| <a href="https://sites.google.com/site/zhusiyucs/home">Siyu Zhu</a><sup>1,2,3,โ </sup> |
| </p> |
|
|
| <p align="center"> |
| <sup>1</sup>Shanghai Academy of AI for Science |
| ยท |
| <sup>2</sup>Shanghai Innovation Institute |
| ยท |
| <sup>3</sup>Fudan University |
| ยท |
| <sup>4</sup>Nanjing University |
| </p> |
|
|
| <p align="center"> |
| ๐ค <a href="https://huggingface.co/fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct">Model</a> |
| | |
| ๐ <a href="https://fudan-generative-vision.github.io/Bard-VL">Project Page</a> |
| | |
| ๐ <a href="https://huggingface.co/papers/2604.16514">Paper</a> |
| | |
| โจ <a href="https://github.com/fudan-generative-vision/Bard-VL">Code</a> |
| </p> |
|
|
| # Bard-VL-B4-Mask-4B-Instruct |
|
|
| **Bard-VL-B4-Mask-4B-Instruct** is a 4B-class vision-language instruction model with **masked discrete-diffusion decoding**. |
|
|
| It is part of the **Bard-VL** family and is designed to bridge autoregressive and diffusion-style vision-language models through **Progressive Block Merging (PBM)** and **Stage-Wise Distillation (SWD)**. |
|
|
| Compared with a standard autoregressive VLM release style, Bard-VL emphasizes: |
|
|
| - **parallel block-wise decoding instead of token-by-token generation** |
| - **controllable response generation through blockwise denoising** |
|
|
| --- |
|
|
| ## โจ Highlights |
|
|
| - **Progressive Block Merging**: Bard-VL increases the decoding block size progressively instead of jumping directly from autoregressive decoding to large-block diffusion. |
| - **Stage-Wise dVLM Distillation**: Bard-VL distills from a small-block diffusion anchor in the same denoising regime, reducing autoregressive-to-diffusion transfer mismatch. |
| - **Packed Multimodal Attention Mask**: the packed attention layout reuses shared multimodal context across clean and noisy branches to reduce redundant computation. |
| - **Mixed-Noise Training**: Bard-VL combines masked-token and uniform token corruption to support both token completion and visible-token revision. |
|
|
| --- |
|
|
| ## ๐งญ Method Structure |
|
|
| <p align="center"> |
| <img src="./model.PNG" alt="Bard-VL method overview" width="100%"> |
| </p> |
|
|
| <p align="center"> |
| <em>Pipeline, block-wise attention mask, and mixed-noise scheduler used by Bard-VL.</em> |
| </p> |
|
|
| --- |
|
|
| ## ๐ Evaluation Results |
|
|
| ### AutoRegressive Vision-Language Models |
|
|
| | Model | Parameters | MMMU<sub>val</sub> | MMMU-Pro<sub>standard</sub> | MME<sub>sum</sub> | RealWorldQA | MMStar | AI2D | ChartQA | |
| |---|---:|---:|---:|---:|---:|---:|---:|---:| |
| | Qwen3-VL | 4B | 47.9 | 35.0 | 2297 | 70.5 | 56.9 | 81.0 | 80.9 | |
| | Qwen3-VL | 8B | 53.0 | 36.0 | 2379 | 69.5 | 59.9 | 83.5 | 84.0 | |
| | InternVL3.5 | 4B | 57.4 | 38.2 | 2236 | 66.7 | 65.6 | 80.6 | 86.2 | |
| | InternVL3.5 | 8B | 57.2 | 41.0 | 2359 | 63.1 | 66.3 | 82.1 | 87.0 | |
|
|
| ### Diffusion Vision-Language Models |
|
|
| | Model | Parameters | MMMU<sub>val</sub> | MMMU-Pro<sub>standard</sub> | MME<sub>sum</sub> | RealWorldQA | MMStar | AI2D | ChartQA | |
| |---|---:|---:|---:|---:|---:|---:|---:|---:| |
| | LLaDA-V | 8B | 48.8 | 35.4 | 1998 | 63.4 | 60.4 | 77.8 | 78.2 | |
| | Dream-VL | 7B | 51.6 | 25.0 | 2179 | 67.7 | 59.9 | 80.4 | 86.2 | |
| | LaviDa | 8B | 44.2 | 28.6 | 1711 | 40.3 | 47.0 | 70.1 | 64.6 | |
| | SDAR-VL | 8B | 44.0 | 28.2 | 2142 | 66.1 | 53.3 | 79.6 | 82.4 | |
| | MMaDA | 8B | 30.2 | 21.5 | 1287 | 28.2 | 25.7 | 54.9 | 43.2 | |
| | Dimple-VL | 7B | 46.4 | 24.1 | 1924 | 51.9 | 47.7 | 74.2 | 58.4 | |
|
|
| ### Bard-VL Converted from Qwen3-VL |
|
|
| | Model | Parameters | MMMU<sub>val</sub> | MMMU-Pro<sub>standard</sub> | MME<sub>sum</sub> | RealWorldQA | MMStar | AI2D | ChartQA | |
| |---|---:|---:|---:|---:|---:|---:|---:|---:| |
| | Bard-VL (*B* = 32) | 2B | 42.0 | 27.9 | 2045 | 64.6 | 53.1 | 72.6 | 76.8 | |
| | Bard-VL (*B* = 32) | 4B | 53.0 | 34.2 | 2305 | 71.9 | 63.6 | 82.8 | 80.2 | |
| | Bard-VL (*B* = 32) | 8B | 54.6 | 37.6 | 2393 | 70.7 | 65.0 | 83.2 | 84.6 | |
|
|
| --- |
|
|
| ## ๐ ๏ธ Environment |
|
|
| Make sure your environment is aligned with the repository `requirements.txt`: |
|
|
| ```bash |
| python>=3.10 |
| torch==2.8.0 |
| torchvision==0.23.0 |
| transformers==4.57.3 |
| diffusers==0.36.0 |
| accelerate==1.12.0 |
| deepspeed==0.17.0 |
| ``` |
|
|
| Recommended runtime settings in the local repository: |
|
|
| ```bash |
| dtype = bfloat16 |
| attn_implementation = sdpa |
| block_size = 4 |
| denoising_steps = 4 |
| ``` |
|
|
| --- |
|
|
| ## ๐ Inference Example |
|
|
| The official repository inference flow is implemented in `inference.py`. A minimal image understanding example aligned with that script is shown below. |
|
|
| ```python |
| import torch |
| from transformers import AutoProcessor |
| |
| from qwen_vl_utils import process_vision_info |
| from nemo_automodel.components.models.bard_vl import BardVLForConditionalGeneration |
| |
| model_id = "fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct" |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| |
| model = BardVLForConditionalGeneration.from_pretrained( |
| model_id, |
| dtype=torch.bfloat16, |
| _attn_implementation="sdpa", |
| ).to(device).eval() |
| processor = AutoProcessor.from_pretrained(model_id) |
| |
| messages = [ |
| { |
| "role": "system", |
| "content": "You are a helpful assistant.", |
| }, |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image", "image": "assets/puzzle.jpg", "min_pixels": 256 * 256, "max_pixels": 2048 * 2048}, |
| {"type": "text", "text": "Please describe this image."}, |
| ], |
| }, |
| ] |
| |
| text = processor.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True, |
| ) |
| |
| image_inputs, video_inputs, video_kwargs = process_vision_info( |
| messages, |
| return_video_kwargs=True, |
| return_video_metadata=False, |
| image_patch_size=processor.image_processor.patch_size, |
| ) |
| |
| batch = processor( |
| text=[text], |
| images=image_inputs, |
| videos=video_inputs, |
| padding=False, |
| return_tensors="pt", |
| **video_kwargs, |
| ).to(device) |
| |
| response_ids = model.generate( |
| batch, |
| max_new_tokens=1024, |
| block_size=4, |
| denoising_steps=4, |
| temperature=0.0, |
| top_k=0, |
| top_p=1.0, |
| remasking_strategy="low_confidence_dynamic", |
| confidence_threshold=0.5, |
| return_step_stats=False, |
| ) |
| |
| print(processor.tokenizer.batch_decode(response_ids, skip_special_tokens=True)[0].strip()) |
| ``` |
|
|
| For video understanding, replace the image message with the video example in `inference.py`. |
|
|
| --- |
|
|
| ## ๐ Citation |
|
|
| ```bibtex |
| @article{chen2026bard, |
| title={BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation}, |
| author={Baoyou Chen and Hanchen Xia and Peng Tu and Haojun Shi and Liwei Zhang and Weihao Yuan and Siyu Zhu}, |
| journal={arXiv preprint arXiv:2604.16514}, |
| year={2026} |
| } |
| ``` |
|
|