Instructions to use fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct

SGLang

How to use fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct with Docker Model Runner:
```
docker model run hf.co/fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct
```

Create README.md

by merryyuan - opened 15 days ago

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+229

-0

Files changed (1) hide show

README.md +229 -0

README.md ADDED Viewed

	@@ -0,0 +1,229 @@

+---
+license: mit
+library_name: transformers
+pipeline_tag: image-text-to-text
+language:
+- en
+- zh
+tags:
+- Bard-VL
+- VLM
+- vision-language
+- multimodal
+- discrete-diffusion
+- masked-decoding
+- custom_code
+metrics:
+- accuracy
+---
+<h1 align="center">BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation</h1>
+<p align="center">
+  <a href="https://github.com/cbyzju">Baoyou Chen</a><sup>1,3</sup> ·
+  <a href="https://github.com/1ring2rta">Hanchen Xia</a><sup>1</sup> ·
+  <a href="https://github.com/yhpengtu-rgb">Peng Tu</a><sup>1</sup> ·
+  <a href="https://github.com/Theseus-427">Haojun Shi</a><sup>1</sup> ·
+  <a href="https://github.com/AricGamma">Liwei Zhang</a><sup>1</sup> ·
+  <a href="https://github.com/weihaosky">Weihao Yuan</a><sup>4</sup> ·
+  <a href="https://sites.google.com/site/zhusiyucs/home">Siyu Zhu</a><sup>1,2,3,†</sup>
+</p>
+<p align="center">
+  <sup>1</sup>Shanghai Academy of AI for Science
+  &nbsp;&nbsp;·&nbsp;&nbsp;
+  <sup>2</sup>Shanghai Innovation Institute
+  &nbsp;&nbsp;·&nbsp;&nbsp;
+  <sup>3</sup>Fudan University
+  &nbsp;&nbsp;·&nbsp;&nbsp;
+  <sup>4</sup>Nanjing University
+</p>
+<p align="center">
+  🤗 <a href="https://huggingface.co/fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct">Model</a>
+  &nbsp;&nbsp;|&nbsp;&nbsp;
+  🏠 <a href="https://fudan-generative-vision.github.io/Bard-VL">Project Page</a>
+  &nbsp;&nbsp;|&nbsp;&nbsp;
+  📑 <a href="https://huggingface.co/papers/2604.16514">Paper</a>
+  &nbsp;&nbsp;|&nbsp;&nbsp;
+  ✨ <a href="https://github.com/fudan-generative-vision/Bard-VL">Code</a>
+</p>
+# Bard-VL-B16-Mask-4B-Distil-Instruct
+**Bard-VL-B16-Mask-4B-Distil-Instruct** is a 4B-class vision-language instruction model with **masked discrete-diffusion decoding**.
+It is part of the **Bard-VL** family and is designed to bridge autoregressive and diffusion-style vision-language models through **Progressive Block Merging (PBM)** and **Stage-Wise Distillation (SWD)**.
+Compared with a standard autoregressive VLM release style, Bard-VL emphasizes:
+- **parallel block-wise decoding instead of token-by-token generation**
+- **controllable response generation through blockwise denoising**
+---
+## ✨ Highlights
+- **Progressive Block Merging**: Bard-VL increases the decoding block size progressively instead of jumping directly from autoregressive decoding to large-block diffusion.
+- **Stage-Wise dVLM Distillation**: Bard-VL distills from a small-block diffusion anchor in the same denoising regime, reducing autoregressive-to-diffusion transfer mismatch.
+- **Packed Multimodal Attention Mask**: the packed attention layout reuses shared multimodal context across clean and noisy branches to reduce redundant computation.
+- **Mixed-Noise Training**: Bard-VL combines masked-token and uniform token corruption to support both token completion and visible-token revision.
+---
+## 🧭 Method Structure
+<p align="center">
+  <img src="./model.PNG" alt="Bard-VL method overview" width="100%">
+</p>
+<p align="center">
+  <em>Pipeline, block-wise attention mask, and mixed-noise scheduler used by Bard-VL.</em>
+</p>
+---
+## 📊 Evaluation Results
+### AutoRegressive Vision-Language Models
+| Model | Parameters | MMMU<sub>val</sub> | MMMU-Pro<sub>standard</sub> | MME<sub>sum</sub> | RealWorldQA | MMStar | AI2D | ChartQA |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|
+| Qwen3-VL | 4B | 47.9 | 35.0 | 2297 | 70.5 | 56.9 | 81.0 | 80.9 |
+| Qwen3-VL | 8B | 53.0 | 36.0 | 2379 | 69.5 | 59.9 | 83.5 | 84.0 |
+| InternVL3.5 | 4B | 57.4 | 38.2 | 2236 | 66.7 | 65.6 | 80.6 | 86.2 |
+| InternVL3.5 | 8B | 57.2 | 41.0 | 2359 | 63.1 | 66.3 | 82.1 | 87.0 |
+### Diffusion Vision-Language Models
+| Model | Parameters | MMMU<sub>val</sub> | MMMU-Pro<sub>standard</sub> | MME<sub>sum</sub> | RealWorldQA | MMStar | AI2D | ChartQA |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|
+| LLaDA-V | 8B | 48.8 | 35.4 | 1998 | 63.4 | 60.4 | 77.8 | 78.2 |
+| Dream-VL | 7B | 51.6 | 25.0 | 2179 | 67.7 | 59.9 | 80.4 | 86.2 |
+| LaviDa | 8B | 44.2 | 28.6 | 1711 | 40.3 | 47.0 | 70.1 | 64.6 |
+| SDAR-VL | 8B | 44.0 | 28.2 | 2142 | 66.1 | 53.3 | 79.6 | 82.4 |
+| MMaDA | 8B | 30.2 | 21.5 | 1287 | 28.2 | 25.7 | 54.9 | 43.2 |
+| Dimple-VL | 7B | 46.4 | 24.1 | 1924 | 51.9 | 47.7 | 74.2 | 58.4 |
+### Bard-VL Converted from Qwen3-VL
+| Model | Parameters | MMMU<sub>val</sub> | MMMU-Pro<sub>standard</sub> | MME<sub>sum</sub> | RealWorldQA | MMStar | AI2D | ChartQA |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|
+| Bard-VL (*B* = 32) | 2B | 42.0 | 27.9 | 2045 | 64.6 | 53.1 | 72.6 | 76.8 |
+| Bard-VL (*B* = 32) | 4B | 53.0 | 34.2 | 2305 | 71.9 | 63.6 | 82.8 | 80.2 |
+| Bard-VL (*B* = 32) | 8B | 54.6 | 37.6 | 2393 | 70.7 | 65.0 | 83.2 | 84.6 |
+---
+## 🛠️ Environment
+Make sure your environment is aligned with the repository `requirements.txt`:
+```bash
+python>=3.10
+torch==2.8.0
+torchvision==0.23.0
+transformers==4.57.3
+diffusers==0.36.0
+accelerate==1.12.0
+deepspeed==0.17.0
+```
+Recommended runtime settings in the local repository:
+```bash
+dtype = bfloat16
+attn_implementation = sdpa
+block_size = 16
+denoising_steps = 16
+```
+---
+## 🚀 Inference Example
+The official repository inference flow is implemented in `inference.py`. A minimal image understanding example aligned with that script is shown below.
+```python
+import torch
+from transformers import AutoProcessor
+from qwen_vl_utils import process_vision_info
+from nemo_automodel.components.models.bard_vl import BardVLForConditionalGeneration
+model_id = "fudan-generative-ai/Bard-VL-B16-Mask-4B-Distil-Instruct"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = BardVLForConditionalGeneration.from_pretrained(
+    model_id,
+    dtype=torch.bfloat16,
+    _attn_implementation="sdpa",
+).to(device).eval()
+processor = AutoProcessor.from_pretrained(model_id)
+messages = [
+    {
+        "role": "system",
+        "content": "You are a helpful assistant.",
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "assets/puzzle.jpg", "min_pixels": 256 * 256, "max_pixels": 2048 * 2048},
+            {"type": "text", "text": "Please describe this image."},
+        ],
+    },
+]
+text = processor.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+image_inputs, video_inputs, video_kwargs = process_vision_info(
+    messages,
+    return_video_kwargs=True,
+    return_video_metadata=False,
+    image_patch_size=processor.image_processor.patch_size,
+)
+batch = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=False,
+    return_tensors="pt",
+    **video_kwargs,
+).to(device)
+response_ids = model.generate(
+    batch,
+    max_new_tokens=1024,
+    block_size=16,
+    denoising_steps=16,
+    temperature=0.0,
+    top_k=0,
+    top_p=1.0,
+    remasking_strategy="low_confidence_dynamic",
+    confidence_threshold=0.5,
+    return_step_stats=False,
+)
+print(processor.tokenizer.batch_decode(response_ids, skip_special_tokens=True)[0].strip())
+```
+For video understanding, replace the image message with the video example in `inference.py`.
+---
+## 📚 Citation
+```bibtex
+@article{chen2026bard,
+  title={BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation},
+  author={Baoyou Chen and Hanchen Xia and Peng Tu and Haojun Shi and Liwei Zhang and Weihao Yuan and Siyu Zhu},
+  journal={arXiv preprint arXiv:2604.16514},
+  year={2026}
+}
+```