Fast-dVLM (3B) — Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

[Paper] [Project Page] [Code] [Fast-dLLM v2]

Introduction

Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one.

Fast-dVLM is a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. Built on Qwen2.5-VL-3B-Instruct, Fast-dVLM directly converts the pretrained AR VLM into a block-diffusion model in a single stage, leveraging the already multimodally aligned VLM.

Key Highlights

Lossless Quality: Matches the AR baseline (Qwen2.5-VL-3B) across 11 multimodal benchmarks (74.0 avg).
Up to 6.18x Speedup: With SGLang integration and FP8 quantization.
2.63x Tokens/NFE: With self-speculative block decoding.
Direct Conversion: Single-stage AR-to-diffusion conversion outperforms two-stage approach (73.3 vs 60.2 avg).

Key Techniques

Block-Size Annealing: Curriculum that progressively increases the block size during training.
Causal Context Attention: Noisy tokens attend bidirectionally within blocks (N2N), to clean tokens from preceding blocks (N2C), while clean tokens follow causal attention (C2C).
Auto-Truncation Masking: Prevents cross-turn leakage in multi-turn dialogue.
Vision-Efficient Concatenation: Vision embeddings included only in the clean stream, reducing peak memory by 15% and training time by 14.2%.

Model Overview

Property	Value
Type	Block Diffusion Vision-Language Model
Base Model	`Qwen/Qwen2.5-VL-3B-Instruct`
Architecture	Transformer w/ M-RoPE, SwiGLU, RMSNorm, GQA
Text Layers	36
Vision Depth	32
Text Hidden Size	2048
Attention Heads	16 (Q), 2 (KV, GQA)
Block Diffusion Size	32

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_name = "Efficient-Large-Model/Fast_dVLM_3B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name, use_fast=False)
processor.tokenizer = tokenizer

prompt = "Describe this image in detail."
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "text", "text": prompt},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

mask_id = tokenizer.encode("|<MASK>|")[0]

generated_ids = model.generate(
    input_ids=inputs.input_ids,
    tokenizer=tokenizer,
    pixel_values=inputs.pixel_values,
    image_grid_thw=inputs.image_grid_thw,
    mask_id=mask_id,
    max_tokens=512,
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Benchmark Results

Fast-dVLM matches the AR baseline on 11 multimodal benchmarks while achieving 2.63x Tokens/NFE with speculative decoding.

Model	AI2D	ChartQA	DocVQA	GQA	MMBench	MMMU	POPE	RWQA	SEED2+	TextVQA	Avg	MMMU-Pro-V	Tok/NFE
Qwen2.5-VL-3B	80.8	84.0	93.1	59.0	76.9	47.3	86.2	65.1	68.6	79.1	74.0	26.3	1.00
Fast-dVLM (MDM)	79.7	82.8	92.1	63.0	74.2	44.6	88.6	65.1	67.2	76.1	73.3	21.4	1.95
Fast-dVLM (spec.)	79.7	83.1	92.9	63.3	74.3	46.6	88.6	65.1	67.2	79.3	74.0	24.6	2.63

Inference Acceleration

Setting	MMMU-Pro-V	TPS	SpeedUp
AR baseline	26.3	56.7	1.00x
Fast-dVLM (MDM, t=0.9)	21.4	82.2	1.45x
+ Spec. decoding (linear)	24.6	112.7	1.98x
+ SGLang serving	24.1	319.0	5.63x
+ SmoothQuant-W8A8 (FP8)	23.8	350.3	6.18x

Citation

If you use Fast-dVLM in your research, please cite:

@misc{wu2026fastdvlmefficientblockdiffusionvlm,
      title={Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM},
      author={Chengyue Wu and Shiyi Lan and Yonggan Fu and Sensen Gao and Jin Wang and Jincheng Yu and Jose M. Alvarez and Pavlo Molchanov and Ping Luo and Song Han and Ligeng Zhu and Enze Xie},
      year={2026},
      eprint={2604.06832},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.06832},
}

License

Released under Apache 2.0, following the base Qwen2.5-VL license.

Downloads last month: -

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Efficient-Large-Model/Fast_dVLM_3B

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(716)

this model

Paper for Efficient-Large-Model/Fast_dVLM_3B

Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

Paper • 2604.06832 • Published 4 days ago