Instructions to use fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import BardVLForConditonalGeneration
model = BardVLForConditonalGeneration.from_pretrained("fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct

SGLang

How to use fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct with Docker Model Runner:
```
docker model run hf.co/fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct
```

Bard-VL-B4-Mask-4B-Instruct / README.md

cbyzju

Create README.md (#2)

039a9ae 15 days ago

preview code

raw

history blame contribute delete

7.51 kB

	---
	license: mit
	library_name: transformers
	pipeline_tag: image-text-to-text
	language:
	- en
	- zh
	tags:
	- Bard-VL
	- VLM
	- vision-language
	- multimodal
	- discrete-diffusion
	- masked-decoding
	- custom_code
	metrics:
	- accuracy
	---

	<h1 align="center">BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation</h1>

	<p align="center">
	<a href="https://github.com/cbyzju">Baoyou Chen</a><sup>1,3</sup> ·
	<a href="https://github.com/1ring2rta">Hanchen Xia</a><sup>1</sup> ·
	<a href="https://github.com/yhpengtu-rgb">Peng Tu</a><sup>1</sup> ·
	<a href="https://github.com/Theseus-427">Haojun Shi</a><sup>1</sup> ·
	<a href="https://github.com/AricGamma">Liwei Zhang</a><sup>1</sup> ·
	<a href="https://github.com/weihaosky">Weihao Yuan</a><sup>4</sup> ·
	<a href="https://sites.google.com/site/zhusiyucs/home">Siyu Zhu</a><sup>1,2,3,†</sup>
	</p>

	<p align="center">
	<sup>1</sup>Shanghai Academy of AI for Science
	·
	<sup>2</sup>Shanghai Innovation Institute
	·
	<sup>3</sup>Fudan University
	·
	<sup>4</sup>Nanjing University
	</p>

	<p align="center">
	🤗 <a href="https://huggingface.co/fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct">Model</a>
	\|
	🏠 <a href="https://fudan-generative-vision.github.io/Bard-VL">Project Page</a>
	\|
	📑 <a href="https://huggingface.co/papers/2604.16514">Paper</a>
	\|
	✨ <a href="https://github.com/fudan-generative-vision/Bard-VL">Code</a>
	</p>

	# Bard-VL-B4-Mask-4B-Instruct

	Bard-VL-B4-Mask-4B-Instruct is a 4B-class vision-language instruction model with masked discrete-diffusion decoding.

	It is part of the Bard-VL family and is designed to bridge autoregressive and diffusion-style vision-language models through Progressive Block Merging (PBM) and Stage-Wise Distillation (SWD).

	Compared with a standard autoregressive VLM release style, Bard-VL emphasizes:

	- parallel block-wise decoding instead of token-by-token generation
	- controllable response generation through blockwise denoising

	---

	## ✨ Highlights

	- Progressive Block Merging: Bard-VL increases the decoding block size progressively instead of jumping directly from autoregressive decoding to large-block diffusion.
	- Stage-Wise dVLM Distillation: Bard-VL distills from a small-block diffusion anchor in the same denoising regime, reducing autoregressive-to-diffusion transfer mismatch.
	- Packed Multimodal Attention Mask: the packed attention layout reuses shared multimodal context across clean and noisy branches to reduce redundant computation.
	- Mixed-Noise Training: Bard-VL combines masked-token and uniform token corruption to support both token completion and visible-token revision.

	---

	## 🧭 Method Structure

	<p align="center">
	<img src="./model.PNG" alt="Bard-VL method overview" width="100%">
	</p>

	<p align="center">
	<em>Pipeline, block-wise attention mask, and mixed-noise scheduler used by Bard-VL.</em>
	</p>

	---

	## 📊 Evaluation Results

	### AutoRegressive Vision-Language Models

	\| Model \| Parameters \| MMMU<sub>val</sub> \| MMMU-Pro<sub>standard</sub> \| MME<sub>sum</sub> \| RealWorldQA \| MMStar \| AI2D \| ChartQA \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| Qwen3-VL \| 4B \| 47.9 \| 35.0 \| 2297 \| 70.5 \| 56.9 \| 81.0 \| 80.9 \|
	\| Qwen3-VL \| 8B \| 53.0 \| 36.0 \| 2379 \| 69.5 \| 59.9 \| 83.5 \| 84.0 \|
	\| InternVL3.5 \| 4B \| 57.4 \| 38.2 \| 2236 \| 66.7 \| 65.6 \| 80.6 \| 86.2 \|
	\| InternVL3.5 \| 8B \| 57.2 \| 41.0 \| 2359 \| 63.1 \| 66.3 \| 82.1 \| 87.0 \|

	### Diffusion Vision-Language Models

	\| Model \| Parameters \| MMMU<sub>val</sub> \| MMMU-Pro<sub>standard</sub> \| MME<sub>sum</sub> \| RealWorldQA \| MMStar \| AI2D \| ChartQA \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| LLaDA-V \| 8B \| 48.8 \| 35.4 \| 1998 \| 63.4 \| 60.4 \| 77.8 \| 78.2 \|
	\| Dream-VL \| 7B \| 51.6 \| 25.0 \| 2179 \| 67.7 \| 59.9 \| 80.4 \| 86.2 \|
	\| LaviDa \| 8B \| 44.2 \| 28.6 \| 1711 \| 40.3 \| 47.0 \| 70.1 \| 64.6 \|
	\| SDAR-VL \| 8B \| 44.0 \| 28.2 \| 2142 \| 66.1 \| 53.3 \| 79.6 \| 82.4 \|
	\| MMaDA \| 8B \| 30.2 \| 21.5 \| 1287 \| 28.2 \| 25.7 \| 54.9 \| 43.2 \|
	\| Dimple-VL \| 7B \| 46.4 \| 24.1 \| 1924 \| 51.9 \| 47.7 \| 74.2 \| 58.4 \|

	### Bard-VL Converted from Qwen3-VL

	\| Model \| Parameters \| MMMU<sub>val</sub> \| MMMU-Pro<sub>standard</sub> \| MME<sub>sum</sub> \| RealWorldQA \| MMStar \| AI2D \| ChartQA \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| Bard-VL (B = 32) \| 2B \| 42.0 \| 27.9 \| 2045 \| 64.6 \| 53.1 \| 72.6 \| 76.8 \|
	\| Bard-VL (B = 32) \| 4B \| 53.0 \| 34.2 \| 2305 \| 71.9 \| 63.6 \| 82.8 \| 80.2 \|
	\| Bard-VL (B = 32) \| 8B \| 54.6 \| 37.6 \| 2393 \| 70.7 \| 65.0 \| 83.2 \| 84.6 \|

	---

	## 🛠️ Environment

	Make sure your environment is aligned with the repository `requirements.txt`:

	```bash
	python>=3.10
	torch==2.8.0
	torchvision==0.23.0
	transformers==4.57.3
	diffusers==0.36.0
	accelerate==1.12.0
	deepspeed==0.17.0
	```

	Recommended runtime settings in the local repository:

	```bash
	dtype = bfloat16
	attn_implementation = sdpa
	block_size = 4
	denoising_steps = 4
	```

	---

	## 🚀 Inference Example

	The official repository inference flow is implemented in `inference.py`. A minimal image understanding example aligned with that script is shown below.

	```python
	import torch
	from transformers import AutoProcessor

	from qwen_vl_utils import process_vision_info
	from nemo_automodel.components.models.bard_vl import BardVLForConditionalGeneration

	model_id = "fudan-generative-ai/Bard-VL-B4-Mask-4B-Instruct"
	device = "cuda" if torch.cuda.is_available() else "cpu"

	model = BardVLForConditionalGeneration.from_pretrained(
	model_id,
	dtype=torch.bfloat16,
	_attn_implementation="sdpa",
	).to(device).eval()
	processor = AutoProcessor.from_pretrained(model_id)

	messages = [
	{
	"role": "system",
	"content": "You are a helpful assistant.",
	},
	{
	"role": "user",
	"content": [
	{"type": "image", "image": "assets/puzzle.jpg", "min_pixels": 256 * 256, "max_pixels": 2048 * 2048},
	{"type": "text", "text": "Please describe this image."},
	],
	},
	]

	text = processor.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	)

	image_inputs, video_inputs, video_kwargs = process_vision_info(
	messages,
	return_video_kwargs=True,
	return_video_metadata=False,
	image_patch_size=processor.image_processor.patch_size,
	)

	batch = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=False,
	return_tensors="pt",
	**video_kwargs,
	).to(device)

	response_ids = model.generate(
	batch,
	max_new_tokens=1024,
	block_size=4,
	denoising_steps=4,
	temperature=0.0,
	top_k=0,
	top_p=1.0,
	remasking_strategy="low_confidence_dynamic",
	confidence_threshold=0.5,
	return_step_stats=False,
	)

	print(processor.tokenizer.batch_decode(response_ids, skip_special_tokens=True)[0].strip())
	```

	For video understanding, replace the image message with the video example in `inference.py`.

	---

	## 📚 Citation

	```bibtex
	@article{chen2026bard,
	title={BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation},
	author={Baoyou Chen and Hanchen Xia and Peng Tu and Haojun Shi and Liwei Zhang and Weihao Yuan and Siyu Zhu},
	journal={arXiv preprint arXiv:2604.16514},
	year={2026}
	}
	```