Instructions to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="nvidia/Nemotron-Labs-Diffusion-VLM-8B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/Nemotron-Labs-Diffusion-VLM-8B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Nemotron-Labs-Diffusion-VLM-8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-VLM-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/Nemotron-Labs-Diffusion-VLM-8B

SGLang

How to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Nemotron-Labs-Diffusion-VLM-8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-VLM-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Nemotron-Labs-Diffusion-VLM-8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-VLM-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with Docker Model Runner:
```
docker model run hf.co/nvidia/Nemotron-Labs-Diffusion-VLM-8B
```

Nemotron-Labs-Diffusion-VLM-8B / README.md

YongganFu

Initial release of Nemotron-Labs-Diffusion-VLM-8B

c6706ba 3 days ago

preview code

raw

history blame contribute delete

5.82 kB

	---
	library_name: transformers
	license: other
	license_name: nscl-v1
	pipeline_tag: image-text-to-text
	tags:
	- nvidia
	- pytorch
	- multimodal
	- vlm
	- diffusion-language-model
	---

	# Nemotron-Labs-Diffusion-VLM-8B


	<div align="center" style="line-height: 1;">
	<a href="https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL" target="_blank" style="margin: 2px;">
	<img alt="Chat" src="https://img.shields.io/badge/📝Paper-Read Now!-536af5?color=76B900&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
	</a>
	<a href="https://huggingface.co/collections/nvidia/nemotron-labs-diffusion" target="_blank" style="margin: 2px;">
	<img alt="Nemotron-Labs-Diffusion Model Family" src="https://img.shields.io/badge/%F0%9F%A4%97-Nemotron--Labs--Diffusion_Model_Family-76B900" style="display: inline-block; vertical-align: middle;"/>
	</a>
	<a href="https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-source-code-license/" style="margin: 2px;">
	<img alt="License" src="https://img.shields.io/badge/License-NSCLv1-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
	</a>
	</div>


	[![Demo](./assets/demo.gif)](./assets/demo.mp4)


	## Model Overview

	Nemotron-Labs-Diffusion-VLM-8B is the vision-language extension of the Nemotron-Labs-Diffusion family. It pairs the same tri-mode language backbone (AR / diffusion / self-speculation, switchable by attention pattern) with a vision encoder, accepting interleaved image + text input and producing text output. The diffusion-based parallel decoding from the LM family carries over to VLM: the language head can draft a block in parallel and verify autoregressively against shared KV cache, retaining the family's decode-efficiency story while extending it to multimodal prompts.

	<div align="center">
	<img src="./assets/teaser.png" alt="An illustration of Tri-Mode LMs" width="500">
	</div>


	## Key Design

	- 8B vision-language model in the Nemotron-Labs-Diffusion family — same tri-mode language backbone (AR, diffusion, self-speculation) plus a Pixtral-style vision encoder.
	- Vision encoder: 24-layer, 1024-hidden, 14×14 patch, supports up to 1540×1540 images with `spatial_merge_size=2`.
	- Language decoder weights match `nvidia/Nemotron-Labs-Diffusion-8B` (34 layers, 4096 hidden, 14336 intermediate); the model card structure and inference modes inherit from the LM line.
	- Diffusion-based parallel decoding works for multimodal prompts: image tokens are placed in the bidirectional context window and text generation proceeds via the same block-wise unmasking + AR verification as the LM family.


	## License/Terms of Use

	Use of this model is governed by the NVIDIA Source Code License (NSCLv1).


	## Environment

	```bash
	transformers>=5.0.0
	pillow
	requests
	opencv-python
	```


	## Chat with Our Model

	```python
	import sys
	import torch
	from huggingface_hub import snapshot_download
	from transformers import AutoModel, AutoTokenizer

	repo_name = "nvidia/Nemotron-Labs-Diffusion-VLM-8B"
	sys.path.insert(0, snapshot_download(repo_name))
	from image_processing import process_messages

	tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
	model = AutoModel.from_pretrained(repo_name, trust_remote_code=True).cuda().to(torch.bfloat16)

	image_path = "path/to/your/image.jpg" # local file or http(s):// URL
	messages = [{
	"role": "user",
	"content": [
	{"type": "image_url", "image_url": {"url": image_path}},
	{"type": "text", "text": "Describe this image."},
	],
	}]

	batch = process_messages(tokenizer, messages, add_generation_prompt=True)
	prompt_ids = batch["input_ids"].to("cuda")
	pixel_values = batch["pixel_values"].to("cuda", dtype=torch.bfloat16)

	out_ids, nfe = model.generate(
	prompt_ids,
	pixel_values=pixel_values,
	image_sizes=batch["image_sizes"],
	max_new_tokens=512, steps=512, block_length=32,
	shift_logits=False, threshold=0.9,
	eos_token_id=tokenizer.eos_token_id,
	)

	tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)
	print(f"Model: {tokenized_out[0]}")
	print(f"[Num Function Eval (NFE)={nfe}]")
	```


	## Ethical Considerations
	NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the [bias](./model_cards/bias.md), [explainability](./model_cards/explainability.md), [safety & security](./model_cards/safety.md), and [privacy](./model_cards/privacy.md) subcards.

	Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).


	## Citations

	```bibtex
	@techreport{fu2026nemotronlabsdiffusion,
	title = {Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding},
	author = {Yonggan Fu and Lexington Whalen and Abhinav Garg and Chengyue Wu and Maksim Khadkevich and Nicolai Oswald and Enze Xie and Daniel Egert and Sharath Turuvekere Sreenivas and Shizhe Diao and Chenhan Yu and Ye Yu and Weijia Chen and Sajad Norouzi and Shiyi Lan and Ligeng Zhu and Jin Wang and Jindong Jiang and Morteza Mardani and Mehran Maghoumi and Song Han and Ante Jukic and Nima Tajbakhsh and Jan Kautz and Pavlo Molchanov},
	institution = {NVIDIA},
	year = {2026},
	note = {Technical report}
	}
	```