Fast-dVLM (3B) โ€” Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

[Paper] [Project Page] [Code] [Fast-dLLM v2]

Introduction

Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one.

Fast-dVLM is a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. Built on Qwen2.5-VL-3B-Instruct, Fast-dVLM directly converts the pretrained AR VLM into a block-diffusion model in a single stage, leveraging the already multimodally aligned VLM.

Key Highlights

  • Lossless Quality: Matches the AR baseline (Qwen2.5-VL-3B) across 11 multimodal benchmarks (74.0 avg).
  • Up to 6.18x Speedup: With SGLang integration and FP8 quantization.
  • 2.63x Tokens/NFE: With self-speculative block decoding.
  • Direct Conversion: Single-stage AR-to-diffusion conversion outperforms two-stage approach (73.3 vs 60.2 avg).

Key Techniques

  • Block-Size Annealing: Curriculum that progressively increases the block size during training.
  • Causal Context Attention: Noisy tokens attend bidirectionally within blocks (N2N), to clean tokens from preceding blocks (N2C), while clean tokens follow causal attention (C2C).
  • Auto-Truncation Masking: Prevents cross-turn leakage in multi-turn dialogue.
  • Vision-Efficient Concatenation: Vision embeddings included only in the clean stream, reducing peak memory by 15% and training time by 14.2%.

Model Overview

Property Value
Type Block Diffusion Vision-Language Model
Base Model Qwen/Qwen2.5-VL-3B-Instruct
Architecture Transformer w/ M-RoPE, SwiGLU, RMSNorm, GQA
Text Layers 36
Vision Depth 32
Text Hidden Size 2048
Attention Heads 16 (Q), 2 (KV, GQA)
Block Diffusion Size 32

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_name = "Efficient-Large-Model/Fast_dVLM_3B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name, use_fast=False)
processor.tokenizer = tokenizer

prompt = "Describe this image in detail."
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "text", "text": prompt},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

mask_id = tokenizer.encode("|<MASK>|")[0]

generated_ids = model.generate(
    input_ids=inputs.input_ids,
    tokenizer=tokenizer,
    pixel_values=inputs.pixel_values,
    image_grid_thw=inputs.image_grid_thw,
    mask_id=mask_id,
    max_tokens=512,
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Benchmark Results

Fast-dVLM matches the AR baseline on 11 multimodal benchmarks while achieving 2.63x Tokens/NFE with speculative decoding.

Model AI2D ChartQA DocVQA GQA MMBench MMMU POPE RWQA SEED2+ TextVQA Avg MMMU-Pro-V Tok/NFE
Qwen2.5-VL-3B 80.8 84.0 93.1 59.0 76.9 47.3 86.2 65.1 68.6 79.1 74.0 26.3 1.00
Fast-dVLM (MDM) 79.7 82.8 92.1 63.0 74.2 44.6 88.6 65.1 67.2 76.1 73.3 21.4 1.95
Fast-dVLM (spec.) 79.7 83.1 92.9 63.3 74.3 46.6 88.6 65.1 67.2 79.3 74.0 24.6 2.63

Inference Acceleration

Setting MMMU-Pro-V TPS SpeedUp
AR baseline 26.3 56.7 1.00x
Fast-dVLM (MDM, t=0.9) 21.4 82.2 1.45x
+ Spec. decoding (linear) 24.6 112.7 1.98x
+ SGLang serving 24.1 319.0 5.63x
+ SmoothQuant-W8A8 (FP8) 23.8 350.3 6.18x

Citation

If you use Fast-dVLM in your research, please cite:

@misc{wu2026fastdvlmefficientblockdiffusionvlm,
      title={Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM},
      author={Chengyue Wu and Shiyi Lan and Yonggan Fu and Sensen Gao and Jin Wang and Jincheng Yu and Jose M. Alvarez and Pavlo Molchanov and Ping Luo and Song Han and Ligeng Zhu and Enze Xie},
      year={2026},
      eprint={2604.06832},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.06832},
}

License

Released under Apache 2.0, following the base Qwen2.5-VL license.

Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Efficient-Large-Model/Fast_dVLM_3B

Finetuned
(716)
this model

Paper for Efficient-Large-Model/Fast_dVLM_3B