--- license: mit library_name: transformers pipeline_tag: image-text-to-text language: - en - zh tags: - Bard-VL - VLM - vision-language - multimodal - discrete-diffusion - masked-decoding - custom_code metrics: - accuracy ---

BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

Baoyou Chen^1,3 · Hanchen Xia¹ · Peng Tu¹ · Haojun Shi¹ · Liwei Zhang¹ · Weihao Yuan⁴ · Siyu Zhu^1,2,3,†

¹Shanghai Academy of AI for Science · ²Shanghai Innovation Institute · ³Fudan University · ⁴Nanjing University

🤗 Model | 🏠 Project Page | 📑 Paper | ✨ Code

# Bard-VL-B4-Mask-2B-Instruct **Bard-VL-B4-Mask-2B-Instruct** is a 2B-class vision-language instruction model with **masked discrete-diffusion decoding**. It is part of the **Bard-VL** family and is designed to bridge autoregressive and diffusion-style vision-language models through **Progressive Block Merging (PBM)** and **Stage-Wise Distillation (SWD)**. Compared with a standard autoregressive VLM release style, Bard-VL emphasizes: - **parallel block-wise decoding instead of token-by-token generation** - **controllable response generation through blockwise denoising** --- ## ✨ Highlights - **Progressive Block Merging**: Bard-VL increases the decoding block size progressively instead of jumping directly from autoregressive decoding to large-block diffusion. - **Stage-Wise dVLM Distillation**: Bard-VL distills from a small-block diffusion anchor in the same denoising regime, reducing autoregressive-to-diffusion transfer mismatch. - **Packed Multimodal Attention Mask**: the packed attention layout reuses shared multimodal context across clean and noisy branches to reduce redundant computation. - **Mixed-Noise Training**: Bard-VL combines masked-token and uniform token corruption to support both token completion and visible-token revision. --- ## 🧭 Method Structure

Bard-VL method overview

Pipeline, block-wise attention mask, and mixed-noise scheduler used by Bard-VL.

--- ## 📊 Evaluation Results ### AutoRegressive Vision-Language Models | Model | Parameters | MMMU_val | MMMU-Pro_standard | MME_sum | RealWorldQA | MMStar | AI2D | ChartQA | |---|---:|---:|---:|---:|---:|---:|---:|---:| | Qwen3-VL | 4B | 47.9 | 35.0 | 2297 | 70.5 | 56.9 | 81.0 | 80.9 | | Qwen3-VL | 8B | 53.0 | 36.0 | 2379 | 69.5 | 59.9 | 83.5 | 84.0 | | InternVL3.5 | 4B | 57.4 | 38.2 | 2236 | 66.7 | 65.6 | 80.6 | 86.2 | | InternVL3.5 | 8B | 57.2 | 41.0 | 2359 | 63.1 | 66.3 | 82.1 | 87.0 | ### Diffusion Vision-Language Models | Model | Parameters | MMMU_val | MMMU-Pro_standard | MME_sum | RealWorldQA | MMStar | AI2D | ChartQA | |---|---:|---:|---:|---:|---:|---:|---:|---:| | LLaDA-V | 8B | 48.8 | 35.4 | 1998 | 63.4 | 60.4 | 77.8 | 78.2 | | Dream-VL | 7B | 51.6 | 25.0 | 2179 | 67.7 | 59.9 | 80.4 | 86.2 | | LaviDa | 8B | 44.2 | 28.6 | 1711 | 40.3 | 47.0 | 70.1 | 64.6 | | SDAR-VL | 8B | 44.0 | 28.2 | 2142 | 66.1 | 53.3 | 79.6 | 82.4 | | MMaDA | 8B | 30.2 | 21.5 | 1287 | 28.2 | 25.7 | 54.9 | 43.2 | | Dimple-VL | 7B | 46.4 | 24.1 | 1924 | 51.9 | 47.7 | 74.2 | 58.4 | ### Bard-VL Converted from Qwen3-VL | Model | Parameters | MMMU_val | MMMU-Pro_standard | MME_sum | RealWorldQA | MMStar | AI2D | ChartQA | |---|---:|---:|---:|---:|---:|---:|---:|---:| | Bard-VL (*B* = 32) | 2B | 42.0 | 27.9 | 2045 | 64.6 | 53.1 | 72.6 | 76.8 | | Bard-VL (*B* = 32) | 4B | 53.0 | 34.2 | 2305 | 71.9 | 63.6 | 82.8 | 80.2 | | Bard-VL (*B* = 32) | 8B | 54.6 | 37.6 | 2393 | 70.7 | 65.0 | 83.2 | 84.6 | --- ## 🛠️ Environment Make sure your environment is aligned with the repository `requirements.txt`: ```bash python>=3.10 torch==2.8.0 torchvision==0.23.0 transformers==4.57.3 diffusers==0.36.0 accelerate==1.12.0 deepspeed==0.17.0 ``` Recommended runtime settings in the local repository: ```bash dtype = bfloat16 attn_implementation = sdpa block_size = 4 denoising_steps = 4 ``` --- ## 🚀 Inference Example The official repository inference flow is implemented in `inference.py`. A minimal image understanding example aligned with that script is shown below. ```python import torch from transformers import AutoProcessor from qwen_vl_utils import process_vision_info from nemo_automodel.components.models.bard_vl import BardVLForConditionalGeneration model_id = "fudan-generative-ai/Bard-VL-B4-Mask-2B-Instruct" device = "cuda" if torch.cuda.is_available() else "cpu" model = BardVLForConditionalGeneration.from_pretrained( model_id, dtype=torch.bfloat16, _attn_implementation="sdpa", ).to(device).eval() processor = AutoProcessor.from_pretrained(model_id) messages = [ { "role": "system", "content": "You are a helpful assistant.", }, { "role": "user", "content": [ {"type": "image", "image": "assets/puzzle.jpg", "min_pixels": 256 * 256, "max_pixels": 2048 * 2048}, {"type": "text", "text": "Please describe this image."}, ], }, ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) image_inputs, video_inputs, video_kwargs = process_vision_info( messages, return_video_kwargs=True, return_video_metadata=False, image_patch_size=processor.image_processor.patch_size, ) batch = processor( text=[text], images=image_inputs, videos=video_inputs, padding=False, return_tensors="pt", **video_kwargs, ).to(device) response_ids = model.generate( batch, max_new_tokens=1024, block_size=4, denoising_steps=4, temperature=0.0, top_k=0, top_p=1.0, remasking_strategy="low_confidence_dynamic", confidence_threshold=0.5, return_step_stats=False, ) print(processor.tokenizer.batch_decode(response_ids, skip_special_tokens=True)[0].strip()) ``` For video understanding, replace the image message with the video example in `inference.py`. --- ## 📚 Citation ```bibtex @article{chen2026bard, title={BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation}, author={Baoyou Chen and Hanchen Xia and Peng Tu and Haojun Shi and Liwei Zhang and Weihao Yuan and Siyu Zhu}, journal={arXiv preprint arXiv:2604.16514}, year={2026} } ```