BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

Baoyou Chen1,3 · Hanchen Xia1 · Peng Tu1 · Haojun Shi1 · Liwei Zhang1 · Weihao Yuan4 · Siyu Zhu1,2,3,†

1Shanghai Academy of AI for Science   ·   2Shanghai Innovation Institute   ·   3Fudan University   ·   4Nanjing University

🤗 Model   |   🏠 Project Page   |   📑 Paper   |   ✨ Code

Bard-VL-B8-Mask-4B-Distil-Instruct

Bard-VL-B8-Mask-4B-Distil-Instruct is a 4B-class vision-language instruction model with masked discrete-diffusion decoding.

It is part of the Bard-VL family and is designed to bridge autoregressive and diffusion-style vision-language models through Progressive Block Merging (PBM) and Stage-Wise Distillation (SWD).

Compared with a standard autoregressive VLM release style, Bard-VL emphasizes:

  • parallel block-wise decoding instead of token-by-token generation
  • controllable response generation through blockwise denoising

✨ Highlights

  • Progressive Block Merging: Bard-VL increases the decoding block size progressively instead of jumping directly from autoregressive decoding to large-block diffusion.
  • Stage-Wise dVLM Distillation: Bard-VL distills from a small-block diffusion anchor in the same denoising regime, reducing autoregressive-to-diffusion transfer mismatch.
  • Packed Multimodal Attention Mask: the packed attention layout reuses shared multimodal context across clean and noisy branches to reduce redundant computation.
  • Mixed-Noise Training: Bard-VL combines masked-token and uniform token corruption to support both token completion and visible-token revision.

🧭 Method Structure

Bard-VL method overview

Pipeline, block-wise attention mask, and mixed-noise scheduler used by Bard-VL.


📊 Evaluation Results

AutoRegressive Vision-Language Models

Model Parameters MMMUval MMMU-Prostandard MMEsum RealWorldQA MMStar AI2D ChartQA
Qwen3-VL 4B 47.9 35.0 2297 70.5 56.9 81.0 80.9
Qwen3-VL 8B 53.0 36.0 2379 69.5 59.9 83.5 84.0
InternVL3.5 4B 57.4 38.2 2236 66.7 65.6 80.6 86.2
InternVL3.5 8B 57.2 41.0 2359 63.1 66.3 82.1 87.0

Diffusion Vision-Language Models

Model Parameters MMMUval MMMU-Prostandard MMEsum RealWorldQA MMStar AI2D ChartQA
LLaDA-V 8B 48.8 35.4 1998 63.4 60.4 77.8 78.2
Dream-VL 7B 51.6 25.0 2179 67.7 59.9 80.4 86.2
LaviDa 8B 44.2 28.6 1711 40.3 47.0 70.1 64.6
SDAR-VL 8B 44.0 28.2 2142 66.1 53.3 79.6 82.4
MMaDA 8B 30.2 21.5 1287 28.2 25.7 54.9 43.2
Dimple-VL 7B 46.4 24.1 1924 51.9 47.7 74.2 58.4

Bard-VL Converted from Qwen3-VL

Model Parameters MMMUval MMMU-Prostandard MMEsum RealWorldQA MMStar AI2D ChartQA
Bard-VL (B = 32) 2B 42.0 27.9 2045 64.6 53.1 72.6 76.8
Bard-VL (B = 32) 4B 53.0 34.2 2305 71.9 63.6 82.8 80.2
Bard-VL (B = 32) 8B 54.6 37.6 2393 70.7 65.0 83.2 84.6

🛠️ Environment

Make sure your environment is aligned with the repository requirements.txt:

python>=3.10
torch==2.8.0
torchvision==0.23.0
transformers==4.57.3
diffusers==0.36.0
accelerate==1.12.0
deepspeed==0.17.0

Recommended runtime settings in the local repository:

dtype = bfloat16
attn_implementation = sdpa
block_size = 8
denoising_steps = 8

🚀 Inference Example

The official repository inference flow is implemented in inference.py. A minimal image understanding example aligned with that script is shown below.

import torch
from transformers import AutoProcessor

from qwen_vl_utils import process_vision_info
from nemo_automodel.components.models.bard_vl import BardVLForConditionalGeneration

model_id = "fudan-generative-ai/Bard-VL-B8-Mask-4B-Distil-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = BardVLForConditionalGeneration.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    _attn_implementation="sdpa",
).to(device).eval()
processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant.",
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "assets/puzzle.jpg", "min_pixels": 256 * 256, "max_pixels": 2048 * 2048},
            {"type": "text", "text": "Please describe this image."},
        ],
    },
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs, video_inputs, video_kwargs = process_vision_info(
    messages,
    return_video_kwargs=True,
    return_video_metadata=False,
    image_patch_size=processor.image_processor.patch_size,
)

batch = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=False,
    return_tensors="pt",
    **video_kwargs,
).to(device)

response_ids = model.generate(
    batch,
    max_new_tokens=1024,
    block_size=8,
    denoising_steps=8,
    temperature=0.0,
    top_k=0,
    top_p=1.0,
    remasking_strategy="low_confidence_dynamic",
    confidence_threshold=0.5,
    return_step_stats=False,
)

print(processor.tokenizer.batch_decode(response_ids, skip_special_tokens=True)[0].strip())

For video understanding, replace the image message with the video example in inference.py.


📚 Citation

@article{chen2026bard,
  title={BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation},
  author={Baoyou Chen and Hanchen Xia and Peng Tu and Haojun Shi and Liwei Zhang and Weihao Yuan and Siyu Zhu},
  journal={arXiv preprint arXiv:2604.16514},
  year={2026}
}
Downloads last month
109
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including fudan-generative-ai/Bard-VL-B8-Mask-4B-Distil-Instruct

Paper for fudan-generative-ai/Bard-VL-B8-Mask-4B-Distil-Instruct