Text Generation
Transformers
Safetensors
PyTorch
nemotron_labs_diffusion
feature-extraction
nvidia
conversational
custom_code
Instructions to use nvidia/Nemotron-Labs-Diffusion-8B-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/Nemotron-Labs-Diffusion-8B-Base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nvidia/Nemotron-Labs-Diffusion-8B-Base", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("nvidia/Nemotron-Labs-Diffusion-8B-Base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use nvidia/Nemotron-Labs-Diffusion-8B-Base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/Nemotron-Labs-Diffusion-8B-Base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Labs-Diffusion-8B-Base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nvidia/Nemotron-Labs-Diffusion-8B-Base
- SGLang
How to use nvidia/Nemotron-Labs-Diffusion-8B-Base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/Nemotron-Labs-Diffusion-8B-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Labs-Diffusion-8B-Base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/Nemotron-Labs-Diffusion-8B-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Labs-Diffusion-8B-Base", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nvidia/Nemotron-Labs-Diffusion-8B-Base with Docker Model Runner:
docker model run hf.co/nvidia/Nemotron-Labs-Diffusion-8B-Base
| library_name: transformers | |
| license: other | |
| license_name: nvidia-nemotron-open-model-license | |
| license_link: >- | |
| https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/ | |
| pipeline_tag: text-generation | |
| tags: | |
| - nvidia | |
| - pytorch | |
| # Nemotron-Labs-Diffusion-8B-Base | |
| <div align="center" style="line-height: 1;"> | |
| <a href="https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL" target="_blank" style="margin: 2px;"> | |
| <img alt="Chat" src="https://img.shields.io/badge/📝Paper-Read Now!-536af5?color=76B900&logoColor=white" style="display: inline-block; vertical-align: middle;"/> | |
| </a> | |
| <a href="https://huggingface.co/collections/nvidia/nemotron-labs-diffusion" target="_blank" style="margin: 2px;"> | |
| <img alt="Nemotron-Labs-Diffusion Model Family" src="https://img.shields.io/badge/%F0%9F%A4%97-Nemotron--Labs--Diffusion_Model_Family-76B900" style="display: inline-block; vertical-align: middle;"/> | |
| </a> | |
| <a href="https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/" style="margin: 2px;"> | |
| <img alt="License" src="https://img.shields.io/badge/License-NVIDIA Open Model License-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/> | |
| </a> | |
| </div> | |
| [](./assets/demo.mp4) | |
| ## Model Overview | |
| Nemotron-Labs-Diffusion is a tri-mode language model that supports both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency. The seamless mode switching by simply changing attention patterns enables high efficiency at different concurrency levels in varying deployment scenarios with one single model. | |
| <div align="center"> | |
| <img src="./assets/teaser.png" alt="An illustration of Tri-Mode LMs" width="500"> | |
| </div> | |
| ## Highlights | |
| - SOTA 3B, 8B, 14B dense LM family (base, instruct, and vision-language variants) supporting AR, diffusion, and self-speculation with the focus on decode efficiency. | |
| - Generation moved from a memory-bound regime toward a compute-bound regime. Model weights are loaded once and reused to compute multiple tokens during generation. | |
| - Self-speculation uses diffusion for drafting and AR for verification, providing a stronger alternative to MTP approaches: | |
| * 3x higher acceptance length and 2.2x speed-up vs. Qwen3-8B-Eagle3 in SGLang. | |
| * 5.9× tokens per forward over Qwen3-8B (no MTP) with the same accuracy. | |
| - Real-device speed-up across platforms: | |
| * DGX Spark (8B, concurrency 1): 2.7x faster with 112 tok/sec vs. 41.8 tok/sec AR using w4a16. | |
| * GB200 (8B, concurrency 1): 3.3x faster with 850 tok/sec vs. 253 tok/sec AR and 360 tok/sec Eagle3. Custom CUDA kernels boost to 1015 tok/sec (4x). | |
| - Diffusion speedup-of-light analysis shows that throughput can be further doubled (vs. current best) for a single user with better sampling - future research. | |
| <div align="center"> | |
| <img src="./assets/result_acc.png" alt="Efficiency Results" width="800"> | |
| </div> | |
| <div align="center"> | |
| <img src="./assets/result_efficiency.png" alt="Acc Results" width="800"> | |
| </div> | |
| ## License/Terms of Use | |
| Use of this model is governed by the [NVIDIA Nemotron Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/). | |
| ## Environment | |
| ```bash | |
| transformers>=5.0.0 | |
| ``` | |
| ## Chat with Our Model | |
| ``` | |
| from transformers import AutoModel, AutoTokenizer | |
| import torch | |
| repo_name = "nvidia/Nemotron-Labs-Diffusion-8B-Base" | |
| tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True) | |
| model = AutoModel.from_pretrained(repo_name, trust_remote_code=True) | |
| model = model.cuda().to(torch.bfloat16) | |
| history = [] | |
| user_input = input("User: ").strip() | |
| history.append({"role": "user", "content": user_input}) | |
| prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True) | |
| prompt_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device='cuda') | |
| ## Chat in AR Mode | |
| out_ids, nfe = model.ar_generate(inputs.input_ids, max_new_tokens=512) | |
| ## Chat in dLM Mode | |
| out_ids, nfe = model.generate(prompt_ids, max_new_tokens=512, block_length=32, threshold=0.9, eos_token_id=tokenizer.eos_token_id) | |
| ## Chat in Linear Self-Speculation Mode | |
| out_ids, nfe = model.linear_spec_generate(prompt_ids, max_new_tokens=512, block_length=32, eos_token_id=tokenizer.eos_token_id) | |
| tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)[0] | |
| print(f"Model: {tokenized_out}") | |
| print(f"[Num Function Eval (NFE)={nfe}]") | |
| ``` | |
| ## Inference with Linear Self-Speculation + LoRA-enhanced Drafter | |
| An optional LoRA adatper can be applied to the diffusion drafter in the linear self-speculation mode to further increase the acceptance length: | |
| ```python | |
| import torch | |
| from transformers import AutoModel, AutoTokenizer | |
| from peft import PeftModel | |
| repo = "nvidia/Nemotron-Labs-Diffusion-8B-Base" | |
| tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True) | |
| model = AutoModel.from_pretrained(repo, trust_remote_code=True) | |
| model = model.cuda().to(torch.bfloat16) | |
| # Attach the linear_spec LoRA adapter. | |
| model = PeftModel.from_pretrained(model, repo, subfolder="linear_spec_lora").eval() | |
| # Unwrap so we can call linear_spec_generate directly (it toggles LoRA internally). | |
| base = model.model | |
| history = [{"role": "user", "content": "Solve: What is 15% of 240?"}] | |
| prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True) | |
| prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda() | |
| out_ids, nfe = base.linear_spec_generate( | |
| prompt_ids, max_new_tokens=512, block_length=32, | |
| eos_token_id=tokenizer.eos_token_id, | |
| ) | |
| print(tokenizer.decode(out_ids[0, prompt_ids.shape[1]:], skip_special_tokens=True)) | |
| print(f"[NFE={nfe}]") | |
| ``` | |
| ## Ethical Considerations | |
| NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the [bias](./model_cards/bias.md), [explainability](./model_cards/explainability.md), [safety & security](./model_cards/safety.md), and [privacy](./model_cards/privacy.md) subcards. | |
| Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). | |
| ## Citations | |
| ```bibtex | |
| @techreport{fu2026nemotronlabsdiffusion, | |
| title = {Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding}, | |
| author = {Yonggan Fu and Lexington Whalen and Abhinav Garg and Chengyue Wu and Maksim Khadkevich and Nicolai Oswald and Enze Xie and Daniel Egert and Sharath Turuvekere Sreenivas and Shizhe Diao and Chenhan Yu and Ye Yu and Weijia Chen and Sajad Norouzi and Shiyi Lan and Ligeng Zhu and Jin Wang and Jindong Jiang and Morteza Mardani and Mehran Maghoumi and Song Han and Ante Jukic and Nima Tajbakhsh and Jan Kautz and Pavlo Molchanov}, | |
| institution = {NVIDIA}, | |
| year = {2026}, | |
| note = {Technical report} | |
| } | |
| ``` | |