Instructions to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="nvidia/Nemotron-Labs-Diffusion-VLM-8B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/Nemotron-Labs-Diffusion-VLM-8B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Nemotron-Labs-Diffusion-VLM-8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-VLM-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/Nemotron-Labs-Diffusion-VLM-8B

SGLang

How to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Nemotron-Labs-Diffusion-VLM-8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-VLM-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Nemotron-Labs-Diffusion-VLM-8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-VLM-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with Docker Model Runner:
```
docker model run hf.co/nvidia/Nemotron-Labs-Diffusion-VLM-8B
```

YongganFu

pmolchanov commited on 3 days ago

Commit

c6706ba

0 Parent(s):

Initial release of Nemotron-Labs-Diffusion-VLM-8B

Browse files

Co-authored-by: pmolchanov <pmolchanov@users.noreply.huggingface.co>

Files changed (28) hide show

.gitattributes +41 -0
README.md +124 -0
assets/demo.gif +3 -0
assets/demo.mp4 +3 -0
assets/result_acc.png +3 -0
assets/result_efficiency.png +3 -0
assets/teaser.png +3 -0
chat_template.jinja +227 -0
chat_utils.py +272 -0
config.json +106 -0
configuration_nemotron_labs_diffusion_vlm.py +259 -0
generation_config.json +11 -0
image_processing.py +296 -0
model-00001-of-00004.safetensors +3 -0
model-00002-of-00004.safetensors +3 -0
model-00003-of-00004.safetensors +3 -0
model-00004-of-00004.safetensors +3 -0
model.safetensors.index.json +539 -0
model_cards/bias.md +4 -0
model_cards/explainability.md +13 -0
model_cards/privacy.md +11 -0
model_cards/safety.md +6 -0
modeling_ministral.py +629 -0
modeling_nemotron_labs_diffusion_vlm.py +1378 -0
special_tokens_map.json +33 -0
tokenization_nemotron_labs_diffusion_vlm.py +46 -0
tokenizer.json +3 -0
tokenizer_config.json +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,41 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/demo.gif filter=lfs diff=lfs merge=lfs -text
+assets/demo.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/teaser.png filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
+assets/result_acc.png filter=lfs diff=lfs merge=lfs -text
+assets/result_efficiency.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,124 @@

+---
+library_name: transformers
+license: other
+license_name: nscl-v1
+pipeline_tag: image-text-to-text
+tags:
+- nvidia
+- pytorch
+- multimodal
+- vlm
+- diffusion-language-model
+---
+# Nemotron-Labs-Diffusion-VLM-8B
+<div align="center" style="line-height: 1;">
+<a href="https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL" target="_blank" style="margin: 2px;">
+    <img alt="Chat" src="https://img.shields.io/badge/📝Paper-Read Now!-536af5?color=76B900&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
+</a>
+<a href="https://huggingface.co/collections/nvidia/nemotron-labs-diffusion" target="_blank" style="margin: 2px;">
+    <img alt="Nemotron-Labs-Diffusion Model Family" src="https://img.shields.io/badge/%F0%9F%A4%97-Nemotron--Labs--Diffusion_Model_Family-76B900" style="display: inline-block; vertical-align: middle;"/>
+</a>
+<a href="https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-source-code-license/" style="margin: 2px;">
+  <img alt="License" src="https://img.shields.io/badge/License-NSCLv1-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
+</a>
+</div>
+[![Demo](./assets/demo.gif)](./assets/demo.mp4)
+## Model Overview
+Nemotron-Labs-Diffusion-VLM-8B is the vision-language extension of the Nemotron-Labs-Diffusion family. It pairs the same tri-mode language backbone (AR / diffusion / self-speculation, switchable by attention pattern) with a vision encoder, accepting interleaved image + text input and producing text output. The diffusion-based parallel decoding from the LM family carries over to VLM: the language head can draft a block in parallel and verify autoregressively against shared KV cache, retaining the family's decode-efficiency story while extending it to multimodal prompts.
+<div align="center">
+<img src="./assets/teaser.png" alt="An illustration of Tri-Mode LMs" width="500">
+</div>
+## Key Design
+- 8B vision-language model in the Nemotron-Labs-Diffusion family — same tri-mode language backbone (AR, diffusion, self-speculation) plus a Pixtral-style vision encoder.
+- Vision encoder: 24-layer, 1024-hidden, 14×14 patch, supports up to 1540×1540 images with `spatial_merge_size=2`.
+- Language decoder weights match `nvidia/Nemotron-Labs-Diffusion-8B` (34 layers, 4096 hidden, 14336 intermediate); the model card structure and inference modes inherit from the LM line.
+- Diffusion-based parallel decoding works for multimodal prompts: image tokens are placed in the bidirectional context window and text generation proceeds via the same block-wise unmasking + AR verification as the LM family.
+## License/Terms of Use
+Use of this model is governed by the **NVIDIA Source Code License (NSCLv1)**.
+## Environment
+```bash
+transformers>=5.0.0
+pillow
+requests
+opencv-python
+```
+## Chat with Our Model
+```python
+import sys
+import torch
+from huggingface_hub import snapshot_download
+from transformers import AutoModel, AutoTokenizer
+repo_name = "nvidia/Nemotron-Labs-Diffusion-VLM-8B"
+sys.path.insert(0, snapshot_download(repo_name))
+from image_processing import process_messages
+tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
+model = AutoModel.from_pretrained(repo_name, trust_remote_code=True).cuda().to(torch.bfloat16)
+image_path = "path/to/your/image.jpg"  # local file or http(s):// URL
+messages = [{
+    "role": "user",
+    "content": [
+        {"type": "image_url", "image_url": {"url": image_path}},
+        {"type": "text", "text": "Describe this image."},
+    ],
+}]
+batch = process_messages(tokenizer, messages, add_generation_prompt=True)
+prompt_ids = batch["input_ids"].to("cuda")
+pixel_values = batch["pixel_values"].to("cuda", dtype=torch.bfloat16)
+out_ids, nfe = model.generate(
+    prompt_ids,
+    pixel_values=pixel_values,
+    image_sizes=batch["image_sizes"],
+    max_new_tokens=512, steps=512, block_length=32,
+    shift_logits=False, threshold=0.9,
+    eos_token_id=tokenizer.eos_token_id,
+)
+tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)
+print(f"Model: {tokenized_out[0]}")
+print(f"[Num Function Eval (NFE)={nfe}]")
+```
+## Ethical Considerations
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the [bias](./model_cards/bias.md), [explainability](./model_cards/explainability.md), [safety & security](./model_cards/safety.md), and [privacy](./model_cards/privacy.md) subcards.
+Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
+## Citations
+```bibtex
+@techreport{fu2026nemotronlabsdiffusion,
+  title       = {Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding},
+  author      = {Yonggan Fu and Lexington Whalen and Abhinav Garg and Chengyue Wu and Maksim Khadkevich and Nicolai Oswald and Enze Xie and Daniel Egert and Sharath Turuvekere Sreenivas and Shizhe Diao and Chenhan Yu and Ye Yu and Weijia Chen and Sajad Norouzi and Shiyi Lan and Ligeng Zhu and Jin Wang and Jindong Jiang and Morteza Mardani and Mehran Maghoumi and Song Han and Ante Jukic and Nima Tajbakhsh and Jan Kautz and Pavlo Molchanov},
+  institution = {NVIDIA},
+  year        = {2026},
+  note        = {Technical report}
+}
+```

assets/demo.gif ADDED Viewed

Git LFS Details

SHA256: 0d09264e272ac0f82dee36417f6a16511287ec1f8dee3b5dba3da222d791fd2c
Pointer size: 132 Bytes
Size of remote file: 8.25 MB

assets/demo.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:666d8785ac4af75931d9c677757c4ef9945bf114d07f1c4e2ebb7b893ac39006
+size 9454873

assets/result_acc.png ADDED Viewed

Git LFS Details

SHA256: 992aa22ca9eca3d0bddbcd9f49837e2a9f377bbc0f7545563b129a50b3811448
Pointer size: 131 Bytes
Size of remote file: 405 kB

assets/result_efficiency.png ADDED Viewed

Git LFS Details

SHA256: 4f6161912e2aa703e0ef1bdccbb85039529b97e759d6247c33afa2a209806ede
Pointer size: 131 Bytes
Size of remote file: 801 kB

assets/teaser.png ADDED Viewed

Git LFS Details

SHA256: 6c94aa7b0c6cf8fb739724d0c1ce45749c76443c592eeab94d7cbb9083c6c6b1
Pointer size: 131 Bytes
Size of remote file: 581 kB

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,227 @@

+{% macro render_extra_keys(json_dict, handled_keys) %}
+    {%- if json_dict is mapping %}
+        {%- for json_key in json_dict if json_key not in handled_keys %}
+            {%- if json_dict[json_key] is mapping or (json_dict[json_key] is sequence and json_dict[json_key] is not string) %}
+                {{- '\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | tojson | safe) ~ '</' ~ json_key ~ '>' }}
+            {%- else %}
+                {{-'\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | string) ~ '</' ~ json_key ~ '>' }}
+            {%- endif %}
+        {%- endfor %}
+    {%- endif %}
+{% endmacro %}
+{%- set enable_thinking = enable_thinking if enable_thinking is defined else True %}
+{%- set truncate_history_thinking = truncate_history_thinking if truncate_history_thinking is defined else True %}
+{%- set ns = namespace(last_user_idx = -1) %}
+{%- set loop_messages = messages %}
+{%- for m in loop_messages %}
+  {%- if m["role"] == "user" %}
+    {%- set ns.last_user_idx = loop.index0 %}
+  {%- endif %}
+{%- endfor %}
+{%- if messages[0]["role"] == "system" %}
+    {%- set system_message = messages[0]["content"] %}
+    {%- set loop_messages = messages[1:] %}
+{%- else %}
+    {%- set system_message = "" %}
+    {%- set loop_messages = messages %}
+{%- endif %}
+{%- if not tools is defined %}
+    {%- set tools = [] %}
+{%- endif %}
+{# Recompute last_user_idx relative to loop_messages after handling system #}
+{%- set ns = namespace(last_user_idx = -1) %}
+{%- for m in loop_messages %}
+  {%- if m["role"] == "user" %}
+    {%- set ns.last_user_idx = loop.index0 %}
+  {%- endif %}
+{%- endfor %}
+{%- if system_message is defined %}
+    {{- "<|im_start|>system\n" + system_message }}
+{%- else %}
+    {%- if tools is iterable and tools | length > 0 %}
+        {{- "<|im_start|>system\n" }}
+    {%- endif %}
+{%- endif %}
+{%- if tools is iterable and tools | length > 0 %}
+    {%- if system_message is defined and system_message | length > 0 %}
+        {{- "\n\n" }}
+    {%- endif %}
+    {{- "# Tools\n\nYou have access to the following functions:\n\n" }}
+    {{- "<tools>" }}
+    {%- for tool in tools %}
+        {%- if tool.function is defined %}
+            {%- set tool = tool.function %}
+        {%- endif %}
+        {{- "\n<function>\n<name>" ~ tool.name ~ "</name>" }}
+        {%- if tool.description is defined %}
+            {{- '\n<description>' ~ (tool.description | trim) ~ '</description>' }}
+        {%- endif %}
+        {{- '\n<parameters>' }}
+        {%- if tool.parameters is defined and tool.parameters is mapping and tool.parameters.properties is defined and tool.parameters.properties is mapping %}
+            {%- for param_name, param_fields in tool.parameters.properties|items %}
+                {{- '\n<parameter>' }}
+                {{- '\n<name>' ~ param_name ~ '</name>' }}
+                {%- if param_fields.type is defined %}
+                    {{- '\n<type>' ~ (param_fields.type | string) ~ '</type>' }}
+                {%- endif %}
+                {%- if param_fields.description is defined %}
+                    {{- '\n<description>' ~ (param_fields.description | trim) ~ '</description>' }}
+                {%- endif %}
+                {%- if param_fields.enum is defined %}
+                    {{- '\n<enum>' ~ (param_fields.enum | tojson | safe) ~ '</enum>' }}
+                {%- endif %}
+                {%- set handled_keys = ['name', 'type', 'description', 'enum'] %}
+                {{- render_extra_keys(param_fields, handled_keys) }}
+                {{- '\n</parameter>' }}
+            {%- endfor %}
+        {%- endif %}
+        {% set handled_keys = ['type', 'properties', 'required'] %}
+        {{- render_extra_keys(tool.parameters, handled_keys) }}
+        {%- if tool.parameters is defined and tool.parameters.required is defined %}
+            {{- '\n<required>' ~ (tool.parameters.required | tojson | safe) ~ '</required>' }}
+        {%- endif %}
+        {{- '\n</parameters>' }}
+        {%- set handled_keys = ['type', 'name', 'description', 'parameters'] %}
+        {{- render_extra_keys(tool, handled_keys) }}
+        {{- '\n</function>' }}
+    {%- endfor %}
+    {{- "\n</tools>" }}
+    {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
+{%- endif %}
+{%- if system_message is defined %}
+    {{- '<|im_end|>\n' }}
+{%- else %}
+    {%- if tools is iterable and tools | length > 0 %}
+        {{- '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in loop_messages %}
+    {%- if message.role == "assistant" %}
+        {# Add reasoning content in to content field for unified processing below. #}
+        {%- if message.reasoning_content is defined and message.reasoning_content is string and message.reasoning_content | trim | length > 0 %}
+            {%- set msg_content = message.content | default('', true) %}
+            {%- if msg_content is not string -%}
+                {%- set ns_rc = namespace(text='') -%}
+                {%- for block in msg_content -%}
+                    {%- if block.type is defined and block.type == "text" -%}
+                        {%- set ns_rc.text = ns_rc.text + block.text -%}
+                    {%- endif -%}
+                {%- endfor -%}
+                {%- set msg_content = ns_rc.text -%}
+            {%- endif -%}
+            {%- set content = "<think>\n" ~ message.reasoning_content ~ "\n</think>\n" ~ msg_content %}
+        {%- else %}
+            {%- set content = message.content | default('', true) %}
+            {%- if content is not string -%}
+                {%- set ns_c = namespace(text='') -%}
+                {%- for block in content -%}
+                    {%- if block.type is defined and block.type == "text" -%}
+                        {%- set ns_c.text = ns_c.text + block.text -%}
+                    {%- endif -%}
+                {%- endfor -%}
+                {%- set content = ns_c.text -%}
+            {%- endif -%}
+            {%- if '<think>' not in content and '</think>' not in content -%}
+                {%- set content = "<think></think>" ~ content -%}
+            {%- endif -%}
+        {%- endif %}
+        {%- if message.tool_calls is defined and message.tool_calls is iterable and message.tool_calls | length > 0 %}
+            {# Assistant message has tool calls. #}
+            {{- '<|im_start|>assistant\n' }}
+                {%- set include_content = not (truncate_history_thinking and loop.index0 < ns.last_user_idx) %}
+                {%- if content is string and content | trim | length > 0 %}
+                    {%- if include_content %}
+                        {{- (content | trim) ~ '\n' -}}
+                    {%- else %}
+                        {%- set c = (content | string) %}
+                        {%- if '</think>' in c %}
+                            {# Keep only content after the last closing think. Also generation prompt causes this. #}
+                            {%- set c = c.split('</think>')[-1] %}
+                        {%- elif '<think>' in c %}
+                            {# If <think> was opened but never closed, drop the trailing think segment #}
+                            {%- set c = c.split('<think>')[0] %}
+                        {%- endif %}
+                        {%- set c = "<think></think>" ~ c | trim %}
+                        {%- if c | length > 0 %}
+                            {{- c ~ '\n' -}}
+                        {%- endif %}
+                    {%- endif %}
+                {%- else %}
+                    {{- "<think></think>" -}}
+                {%- endif %}
+                {%- for tool_call in message.tool_calls %}
+                    {%- if tool_call.function is defined %}
+                        {%- set tool_call = tool_call.function %}
+                    {%- endif %}
+                    {{- '<tool_call>\n<function=' ~ tool_call.name ~ '>\n' -}}
+                        {%- if tool_call.arguments is defined %}
+                            {%- for args_name, args_value in tool_call.arguments|items %}
+                                {{- '<parameter=' ~ args_name ~ '>\n' -}}
+                                    {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
+                                {{- args_value ~ '\n</parameter>\n' -}}
+                            {%- endfor %}
+                        {%- endif %}
+                    {{- '</function>\n</tool_call>\n' -}}
+                {%- endfor %}
+                {{- '<|im_end|>\n' }}
+        {%- else %}
+            {# Assistant message doesn't have tool calls. #}
+            {%- if not (truncate_history_thinking and loop.index0 < ns.last_user_idx) %}
+                {{- '<|im_start|>assistant\n' ~ (content | default('', true) | string | trim) ~ '<|im_end|>\n' }}
+            {%- else %}
+                {%- set c = (content | default('', true) | string) %}
+                {%- if '<think>' in c and '</think>' in c %}
+                    {%- set c = "<think></think>" ~ c.split('</think>')[-1] %}
+                {%- endif %}
+                {%- set c = c | trim %}
+                {%- if c | length > 0 %}
+                    {{- '<|im_start|>assistant\n' ~ c ~ '<|im_end|>\n' }}
+                {%- else %}
+                    {{- '<|im_start|>assistant\n<|im_end|>\n' }}
+                {%- endif %}
+            {%- endif %}
+        {%- endif %}
+    {%- elif message.role == "user" or message.role == "system" %}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for block in message.content %}
+                {%- if block.type == "text" %}
+                    {{- block.text }}
+                {%- elif block.type in ["image", "image_url"] %}
+                    {{- '<|image_start|>' }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.previtem and loop.previtem.role != "tool" %}
+            {{- '<|im_start|>user\n' }}
+        {%- endif %}
+        {{- '<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>\n' }}
+        {%- if not loop.last and loop.nextitem.role != "tool" %}
+            {{- '<|im_end|>\n' }}
+        {%- elif loop.last %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- else %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {%- if enable_thinking %}
+        {{- '<|im_start|>assistant\n<think>\n' }}
+    {%- else %}
+        {{- '<|im_start|>assistant\n<think></think>' }}
+    {%- endif %}
+{%- endif %}

chat_utils.py ADDED Viewed

	@@ -0,0 +1,272 @@

+import numpy as np
+import torch
+import torch.nn.functional as F
+def add_gumbel_noise(logits, temperature):
+    '''
+    The Gumbel max is a method for sampling categorical distributions.
+    According to arXiv:2409.02908, for MDM, low-precision Gumbel Max improves perplexity score but reduces generation quality.
+    Thus, we use float64.
+    '''
+    if temperature == 0:
+        return logits
+    logits = logits.to(torch.float64)
+    noise = torch.rand_like(logits, dtype=torch.float64)
+    gumbel_noise = (- torch.log(noise)) ** temperature
+    return logits.exp() / gumbel_noise
+def get_transfer_index(logits, temperature, remasking, mask_index, x, num_transfer_tokens, threshold=None, neg_entropy=False):
+    logits_with_noise = add_gumbel_noise(logits, temperature=temperature)
+    x0 = torch.argmax(logits_with_noise, dim=-1)
+    if remasking == 'low_confidence':
+        # p = F.softmax(logits.to(torch.float64), dim=-1)
+        p = F.softmax(logits, dim=-1)
+        x0_p = torch.squeeze(
+            torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1) # b, l
+    elif remasking == 'top_p_margin':
+        # Compute probabilities
+        p = F.softmax(logits, dim=-1)                       # (B, L, V)
+        # Top-2 per position
+        top2 = torch.topk(p, k=2, dim=-1).values            # (B, L, 2)
+        margin = top2[..., 0] - top2[..., 1]                # (B, L)
+        # Normalize margin to [0,1] over MASKED positions per row
+        plus_inf  = torch.full_like(margin, float('inf'))
+        minus_inf = torch.full_like(margin, float('-inf'))
+        masked_for_min = torch.where(mask_index, margin, plus_inf)
+        masked_for_max = torch.where(mask_index, margin, minus_inf)
+        row_min = masked_for_min.amin(dim=1, keepdim=True)  # (B, 1)
+        row_max = masked_for_max.amax(dim=1, keepdim=True)  # (B, 1)
+        denom = (row_max - row_min)
+        # If denom==0 (all equal), set normalized=1 on masked; 0 elsewhere by default
+        normalized = torch.zeros_like(margin)
+        nonzero = denom > 0
+        normalized = torch.where(
+            mask_index & nonzero,
+            (margin - row_min) / (denom + 1e-12),
+            normalized
+        )
+        normalized = torch.where(
+            mask_index & (~nonzero),
+            torch.ones_like(normalized),
+            normalized
+        )
+        x0_p = normalized  # ∈ [0,1] on masked positions
+    elif remasking == 'random':
+        x0_p = torch.rand((x0.shape[0], x0.shape[1]), device=x0.device)
+    else:
+        raise NotImplementedError(remasking)
+    # Calculate negative entropy if requested
+    if neg_entropy:
+        # p = F.softmax(logits.to(torch.float64), dim=-1)
+        p = F.softmax(logits, dim=-1)
+        epsilon = 1e-10
+        log_probs = torch.log(p + epsilon)
+        confidence_scores = torch.sum(p * log_probs, dim=-1)  # negative entropy per position
+    else:
+        confidence_scores = x0_p
+    x0 = torch.where(mask_index, x0, x)
+    confidence = torch.where(mask_index, confidence_scores, -np.inf)
+    transfer_index = torch.zeros_like(x0, dtype=torch.bool, device=x0.device)
+    if threshold is not None:
+        num_transfer_tokens = mask_index.sum(dim=1, keepdim=True)
+    # print(f'confidence: {confidence}')
+    for j in range(confidence.shape[0]):
+        _, select_index = torch.topk(confidence[j], k=num_transfer_tokens[j])
+        transfer_index[j, select_index] = True
+        if threshold is not None:
+            for k in range(1, num_transfer_tokens[j]):
+                if confidence[j, select_index[k]] < threshold:
+                    transfer_index[j, select_index[k]] = False
+    return x0, transfer_index
+def get_num_transfer_tokens(mask_index, steps: int):
+    mask_num = mask_index.sum(dim=1, keepdim=True)
+    base = mask_num // steps
+    remainder = mask_num % steps
+    num_transfer_tokens = torch.zeros(mask_num.size(0), steps, device=mask_index.device, dtype=torch.int64) + base
+    for i in range(mask_num.size(0)):
+        num_transfer_tokens[i, : int(remainder[i])] += 1
+    return num_transfer_tokens
+@torch.no_grad()
+def generate_with_prefix_cache_block_diff(
+    model,
+    prompt,
+    steps=128,
+    gen_length=128,
+    block_length=128,
+    temperature=0.,
+    remasking='low_confidence',
+    mask_id=126336,
+    threshold=None,
+    factor=None,
+    shift_logits=False,
+    neg_entropy=False,
+    causal_context=False,
+    pixel_values=None,
+    image_sizes=None,
+    eos_token_id=None,
+):
+    dream_style=shift_logits
+    x_accum = prompt.clone()
+    assert gen_length % block_length == 0
+    num_blocks = gen_length // block_length
+    assert steps % num_blocks == 0
+    steps_per_block = steps // num_blocks
+    nfe = 0
+    if causal_context:
+        model_module = model.module if hasattr(model, "module") else model
+        for layer in model_module.encoder.layers:
+            if hasattr(layer.self_attn, 'diffusion_lm'):
+                layer.self_attn.diffusion_lm=False
+    # Compute KV cache for the prompt initially
+    # Pass pixel_values/image_sizes only for this first call (prompt contains image tokens)
+    output = model(prompt, use_cache=True, use_causal_mask=causal_context,
+                   pixel_values=pixel_values, image_sizes=image_sizes)
+    past_key_values = output.past_key_values
+    if causal_context:
+        for layer in model_module.encoder.layers:
+            if hasattr(layer.self_attn, 'diffusion_lm'):
+                layer.self_attn.diffusion_lm=True
+    # For dream_style: store the "next token logit" of the context
+    next_logits_context = None
+    if dream_style:
+        next_logits_context = output.logits[:, -1:, :]  # (B, 1, V)
+    for num_block in range(num_blocks):
+        # Create a new block with mask tokens (no seeding)
+        mask_block = torch.ones(
+            (prompt.shape[0], block_length),
+            dtype=prompt.dtype,
+            device=prompt.device
+        ) * mask_id
+        # Append the block of masks
+        x_accum = torch.cat([x_accum, mask_block], dim=1)
+        current_block_start = prompt.size(1) + num_block * block_length
+        block_slice = slice(current_block_start, current_block_start + block_length)
+        # Build the initial mask for this block
+        mask_block_idx0 = (x_accum[:, block_slice] == mask_id)  # (B, Lb)
+        # Precompute the transfer schedule for this block
+        if dream_style:
+            # still denoise *all* positions (0..Lb-1), since none are seeded
+            schedule_mask = mask_block_idx0
+        else:
+            schedule_mask = mask_block_idx0
+        num_transfer_tokens = get_num_transfer_tokens(schedule_mask, steps_per_block)  # (B, steps)
+        # Denoise the current block
+        for i in range(steps_per_block):
+            mask_block_idx = (x_accum[:, block_slice] == mask_id)  # (B, Lb)
+            if mask_block_idx.sum() == 0:
+                break
+            nfe += 1
+            # Forward only the current noisy block using cached context
+            logits_block = model(
+                x_accum[:, block_slice],
+                past_key_values=past_key_values,
+                use_cache=False
+            ).logits
+            if dream_style:
+                # Align logits so that each masked position has a predictor:
+                # prepend context-next logit, then use logits_block[:-1]
+                if block_length == 1:
+                    logits_use = next_logits_context              # (B, 1, V)
+                else:
+                    logits_use = torch.cat(
+                        [next_logits_context, logits_block[:, :-1, :]],
+                        dim=1
+                    )  # (B, Lb, V)
+                mask_use = mask_block_idx                        # (B, Lb)
+                x_use   = x_accum[:, block_slice]                # (B, Lb)
+                x0, transfer_idx = get_transfer_index(
+                    logits_use, temperature, remasking, mask_use, x_use,
+                    num_transfer_tokens=num_transfer_tokens[:, i],
+                    threshold=threshold, neg_entropy=neg_entropy
+                )
+                cur = x_accum[:, block_slice].clone()
+                cur[transfer_idx] = x0[transfer_idx]
+                x_accum[:, block_slice] = cur
+            else:
+                # non-AR (same-position) case
+                x0, transfer_idx = get_transfer_index(
+                    logits_block, temperature, remasking, mask_block_idx,
+                    x_accum[:, block_slice],
+                    num_transfer_tokens=num_transfer_tokens[:, i],
+                    threshold=threshold, neg_entropy=neg_entropy
+                )
+                cur = x_accum[:, block_slice].clone()
+                cur[transfer_idx] = x0[transfer_idx]
+                x_accum[:, block_slice] = cur
+            if eos_token_id is not None:
+                block_tokens = x_accum[:, block_slice]              # (B, Lb)
+                eos_mask = (block_tokens == eos_token_id)           # (B, Lb)
+                any_eos = eos_mask.any(dim=1)                       # (B,)
+                if any_eos.any():
+                    after_eos = eos_mask.cumsum(dim=1).bool()       # (B, Lb)
+                    mask_before = (block_tokens == mask_id) & ~after_eos
+                    if (any_eos & ~mask_before.any(dim=1)).any():
+                        break
+        if causal_context:
+            for layer in model_module.encoder.layers:
+                if hasattr(layer.self_attn, 'diffusion_lm'):
+                    layer.self_attn.diffusion_lm=False
+        # after block is fully denoised, update KV cache
+        output = model(
+            x_accum[:, block_slice],
+            past_key_values=past_key_values,
+            use_cache=True,
+            use_causal_mask=causal_context
+        )
+        past_key_values = output.past_key_values
+        if causal_context:
+            for layer in model_module.encoder.layers:
+                if hasattr(layer.self_attn, 'diffusion_lm'):
+                    layer.self_attn.diffusion_lm=True
+        if dream_style and num_block < num_blocks - 1:
+            # refresh context-next logit for the next block
+            next_logits_context = output.logits[:, -1:, :]  # (B, 1, V)
+        if eos_token_id is not None:
+            gen_so_far = x_accum[:, prompt.size(1):]                    # (B, gen_len_so_far)
+            is_eos = (gen_so_far == eos_token_id)                       # (B, gen_len_so_far)
+            has_eos = is_eos.any(dim=1)                                 # (B,)
+            if has_eos.all():
+                return x_accum, nfe
+                # first_eos_pos = is_eos.to(torch.int64).argmax(dim=1)    # (B,)
+                # max_eos = first_eos_pos.max().item()
+                # return x_accum[:, : prompt.size(1) + max_eos + 1], nfe
+    return x_accum, nfe

config.json ADDED Viewed

	@@ -0,0 +1,106 @@

+{
+  "ada_dlm_loss_ratio": null,
+  "ada_perm_ratio_global": null,
+  "ada_perm_ratio_per_block": null,
+  "adaptive_mask_rate": false,
+  "always_mask_im_end": false,
+  "ar_loss_weight": 1.0,
+  "architectures": [
+    "NemotronLabsDiffusionVLMModel"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "attn_implementation": null,
+  "auto_map": {
+    "AutoConfig": "configuration_nemotron_labs_diffusion_vlm.NemotronLabsDiffusionVLMConfig",
+    "AutoModel": "modeling_nemotron_labs_diffusion_vlm.NemotronLabsDiffusionVLMModel",
+    "AutoModelForCausalLM": "modeling_nemotron_labs_diffusion_vlm.NemotronLabsDiffusionVLMModel"
+  },
+  "block_size": 32,
+  "bos_token_id": 1,
+  "complementary_mask": true,
+  "diff_loss_weight": 0.5,
+  "dlm_arch": "encoder",
+  "dlm_loss_weight": 0.5,
+  "dlm_paradigm": "bidirectional",
+  "dlm_type": "llada",
+  "dp_varying_mask_ratio": false,
+  "dtype": "bfloat16",
+  "enforce_mask": false,
+  "eos_token_id": 11,
+  "global_loss_avg": true,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "im_end_token_id": 11,
+  "initializer_range": 0.02,
+  "intermediate_size": 14336,
+  "mask_token_id": 100,
+  "max_position_embeddings": 262144,
+  "mlp_bias": false,
+  "model_type": "nemotron_labs_diffusion_vlm",
+  "multi_sampling": null,
+  "multimodal_projector_bias": false,
+  "num_ar_layers": 0,
+  "num_attention_heads": 32,
+  "num_diffusion_layers": 0,
+  "num_hidden_layers": 34,
+  "num_key_value_heads": 8,
+  "num_skip_loss_tokens": 0,
+  "pad_token_id": 11,
+  "prefix_ratio": 0.8,
+  "projector_hidden_act": "gelu",
+  "random_length_prob": 0,
+  "rms_norm_eps": 1e-05,
+  "rope_parameters": {
+    "beta_fast": 32.0,
+    "beta_slow": 1.0,
+    "factor": 16.0,
+    "llama_4_scaling_beta": 0.1,
+    "mscale": 1.0,
+    "mscale_all_dim": 1.0,
+    "original_max_position_embeddings": 16384,
+    "rope_theta": 1000000.0,
+    "rope_type": "yarn",
+    "type": "yarn"
+  },
+  "rope_scaling": {
+    "beta_fast": 32.0,
+    "beta_slow": 1.0,
+    "factor": 16.0,
+    "llama_4_scaling_beta": 0.1,
+    "mscale": 1.0,
+    "mscale_all_dim": 1.0,
+    "original_max_position_embeddings": 16384,
+    "rope_theta": 1000000.0,
+    "rope_type": "yarn",
+    "type": "yarn"
+  },
+  "rope_theta": 1000000.0,
+  "sliding_window": null,
+  "spatial_merge_size": 2,
+  "tie_word_embeddings": false,
+  "tok_mask_half_life_ratio": null,
+  "transformers_version": "4.57.1",
+  "use_cache": false,
+  "vision_config": {
+    "attention_dropout": 0.0,
+    "head_dim": 64,
+    "hidden_act": "silu",
+    "hidden_size": 1024,
+    "image_size": 1540,
+    "initializer_range": 0.02,
+    "intermediate_size": 4096,
+    "model_type": "pixtral",
+    "num_attention_heads": 16,
+    "num_channels": 3,
+    "num_hidden_layers": 24,
+    "patch_size": 14,
+    "rope_parameters": {
+      "rope_theta": 10000.0,
+      "rope_type": "default"
+    }
+  },
+  "vision_feature_layer": -1,
+  "vocab_size": 131073
+}

configuration_nemotron_labs_diffusion_vlm.py ADDED Viewed

	@@ -0,0 +1,259 @@

+# coding=utf-8
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Nemotron-Labs Diffusion VLM model configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from transformers.modeling_rope_utils import rope_config_validation
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class NemotronLabsDiffusionVLMConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`Ministral3Model`] for diffusion language models.
+    It is used to instantiate a Ministral model according to the specified arguments, defining the model architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 131072):
+            Vocabulary size of the Ministral model.
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 14336):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 34):
+            Number of hidden layers in the Transformer decoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer.
+        num_key_value_heads (`int`, *optional*, defaults to 8):
+            Number of key_value heads for Grouped Query Attention.
+        head_dim (`int`, *optional*, defaults to 128):
+            The attention head dimension.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function.
+        max_position_embeddings (`int`, *optional*, defaults to 262144):
+            The maximum sequence length.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-05):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied.
+        rope_theta (`float`, *optional*, defaults to 1000000.0):
+            The base period of the RoPE embeddings.
+        rope_parameters (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings.
+            Default uses YaRN scaling with factor=16, original_max_position_embeddings=16384.
+        attention_bias (`bool`, defaults to `False`):
+            Whether to use a bias in the query, key, value and output projection layers.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        mlp_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use a bias in up_proj, down_proj and gate_proj layers.
+        sliding_window (`int`, *optional*, defaults to None):
+            Sliding window attention size.
+        mask_token_id (`int`, *optional*, defaults to -1):
+            Token ID for masking in diffusion.
+        dlm_type (`str`, *optional*, defaults to 'llada'):
+            Type of diffusion language model ('llada', 'dream').
+        random_length_prob (`float`, *optional*):
+            Probability of using random lengths during training.
+        num_ar_layers (`int`, *optional*, defaults to 0):
+            Number of autoregressive layers.
+        num_diffusion_layers (`int`, *optional*, defaults to 0):
+            Number of diffusion layers.
+        diff_loss_weight (`float`, *optional*, defaults to 1):
+            Weight for diffusion loss.
+        enforce_mask (`bool`, *optional*, defaults to False):
+            Whether to enforce masking.
+        prefix_ratio (`float`, *optional*, defaults to 0.8):
+            Ratio for prefix in prefix_bidirectional mode.
+        dlm_paradigm (`str`, *optional*, defaults to 'bidirectional'):
+            Paradigm for diffusion ('bidirectional', 'autoregressive', 'prefix_bidirectional', 'efficient_block_diff', 'block_diff', 'sbd_block_diff').
+        dlm_arch (`str`, *optional*, defaults to 'encoder'):
+            Architecture type ('encoder', 'encoder_decoder').
+        block_size (`int`, *optional*, defaults to 32):
+            Block size for block diffusion paradigms.
+        tok_mask_half_life_ratio (`float`, *optional*):
+            Half-life ratio for token masking.
+        adaptive_mask_rate (`bool`, *optional*, defaults to False):
+            Whether to use adaptive mask rate.
+        multi_sampling (`int`, *optional*):
+            Number of samples for multi-sampling.
+        num_skip_loss_tokens (`int`, *optional*, defaults to 0):
+            Number of tokens to skip in loss calculation.
+        dlm_loss_weight (`float`, *optional*):
+            Weight for diffusion LM loss.
+        ar_loss_weight (`float`, *optional*, defaults to 1.0):
+            Weight for autoregressive loss in sbd_block_diff paradigm. Use 10000 to only use AR loss.
+        global_loss_avg (`bool`, *optional*, defaults to False):
+            Whether to use global loss average.
+        dp_varying_mask_ratio (`bool`, *optional*, defaults to False):
+            Whether to use varying mask ratio for each DP rank during sampling.
+        ada_perm_ratio_per_block (`float`, *optional*):
+            Adaptive permutation ratio for each block.
+        ada_perm_ratio_global (`float`, *optional*):
+            Adaptive permutation ratio for global.
+        complementary_mask (`bool`, *optional*, defaults to False):
+            Whether to use complementary masking (mask + inverse mask).
+        always_mask_im_end (`bool`, *optional*, defaults to False):
+            Whether to always mask im_end tokens.
+        im_end_token_id (`int`, *optional*, defaults to 11):
+            Token ID for im_end in always_mask_im_end.
+    """
+    model_type = "nemotron_labs_diffusion_vlm"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    # Default tensor parallel plan for base model `Ministral`
+    base_model_tp_plan = {
+        "layers.*.self_attn.q_proj": "colwise",
+        "layers.*.self_attn.k_proj": "colwise",
+        "layers.*.self_attn.v_proj": "colwise",
+        "layers.*.self_attn.o_proj": "rowwise",
+        "layers.*.mlp.gate_proj": "colwise",
+        "layers.*.mlp.up_proj": "colwise",
+        "layers.*.mlp.down_proj": "rowwise",
+    }
+    base_model_pp_plan = {
+        "embed_tokens": (["input_ids"], ["inputs_embeds"]),
+        "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
+        "norm": (["hidden_states"], ["hidden_states"]),
+    }
+    def __init__(
+        self,
+        vocab_size=131072,
+        hidden_size=4096,
+        intermediate_size=14336,
+        num_hidden_layers=34,
+        num_attention_heads=32,
+        num_key_value_heads=8,
+        head_dim=128,
+        hidden_act="silu",
+        max_position_embeddings=262144,
+        initializer_range=0.02,
+        rms_norm_eps=1e-05,
+        use_cache=True,
+        pad_token_id=None,
+        bos_token_id=1,
+        eos_token_id=2,
+        tie_word_embeddings=False,
+        rope_theta=1000000.0,
+        rope_parameters=None,
+        rope_scaling=None,
+        attention_bias=False,
+        attention_dropout=0.0,
+        mlp_bias=False,
+        sliding_window=None,
+        attn_implementation="sdpa",
+        mask_token_id=-1,
+        dlm_type='llada',
+        random_length_prob=None,
+        num_ar_layers=0,
+        num_diffusion_layers=0,
+        diff_loss_weight=1,
+        enforce_mask=False,
+        prefix_ratio=0.8,
+        dlm_paradigm='bidirectional',
+        dlm_arch='encoder',
+        block_size=32,
+        tok_mask_half_life_ratio=None,
+        adaptive_mask_rate=False,
+        multi_sampling=None,
+        num_skip_loss_tokens=0,
+        dlm_loss_weight=None,
+        ar_loss_weight=1.0,
+        global_loss_avg=False,
+        dp_varying_mask_ratio=False,
+        ada_perm_ratio_per_block=None,
+        ada_perm_ratio_global=None,
+        ada_dlm_loss_ratio=None,
+        complementary_mask=False,
+        always_mask_im_end=False,
+        im_end_token_id=11,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = head_dim
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.rope_parameters = rope_parameters
+        self.rope_scaling = rope_scaling
+        self.attention_bias = attention_bias
+        self.attention_dropout = attention_dropout
+        self.mlp_bias = mlp_bias
+        self.sliding_window = sliding_window
+        rope_config_validation(self)
+        self.attn_implementation = attn_implementation
+        self.mask_token_id = mask_token_id
+        self.dlm_type = dlm_type
+        self.random_length_prob = random_length_prob
+        self.num_ar_layers = num_ar_layers
+        self.num_diffusion_layers = num_diffusion_layers
+        self.diff_loss_weight = diff_loss_weight
+        self.enforce_mask = enforce_mask
+        self.prefix_ratio = prefix_ratio
+        self.dlm_paradigm = dlm_paradigm
+        self.dlm_arch = dlm_arch
+        self.block_size = block_size
+        self.tok_mask_half_life_ratio = tok_mask_half_life_ratio
+        self.adaptive_mask_rate = adaptive_mask_rate
+        self.multi_sampling = multi_sampling
+        self.num_skip_loss_tokens = num_skip_loss_tokens
+        self.dlm_loss_weight = dlm_loss_weight
+        self.ar_loss_weight = ar_loss_weight
+        self.global_loss_avg = global_loss_avg
+        self.dp_varying_mask_ratio = dp_varying_mask_ratio
+        self.ada_perm_ratio_per_block = ada_perm_ratio_per_block
+        self.ada_perm_ratio_global = ada_perm_ratio_global
+        self.ada_dlm_loss_ratio = ada_dlm_loss_ratio
+        self.complementary_mask = complementary_mask
+        self.always_mask_im_end = always_mask_im_end
+        self.im_end_token_id = im_end_token_id
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+__all__ = ["NemotronLabsDiffusionVLMConfig"]

generation_config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": [
+    11,
+    2
+  ],
+  "pad_token_id": 11,
+  "transformers_version": "4.57.1",
+  "use_cache": false
+}

image_processing.py ADDED Viewed

	@@ -0,0 +1,296 @@

+"""
+Image processing utilities for Nemotron-Diffusion-Exp-Ministral-8B-Instruct (final-template).
+Implements image token expansion and pixel value preprocessing,
+faithfully ported from mistral_common.tokens.tokenizers.image.ImageEncoder
+to ensure identical image sizing and token counts.
+Special token mapping (final-template version):
+    <|image_start|> (id=18) = [IMG_START]   image start marker
+    <|image_pad|>   (id=19) = [IMG]         image pad token (one per merged patch)
+    <|image_break|> (id=20) = [IMG_BREAK]   image row break
+    <|image_end|>   (id=21) = [IMG_END]     image end marker
+After expansion, each image placeholder becomes:
+    [IMG_START] ([IMG]*W [IMG_BREAK]) * (H-1)  [IMG]*W [IMG_END]
+where W = width_tokens, H = height_tokens (computed via ceiling division
+on the original image dims, matching mistral_common exactly).
+"""
+import os
+from io import BytesIO
+from typing import Any, Dict, List, Tuple, Union
+import cv2
+import numpy as np
+import requests
+import torch
+from PIL import Image
+# ── Token strings (must match tokenizer_config.json) ──────────────────────────
+IMG_START_TOKEN = "<|image_start|>"   # id = 18
+IMG_PAD_TOKEN   = "<|image_pad|>"     # id = 19
+IMG_BREAK_TOKEN = "<|image_break|>"   # id = 20
+IMG_END_TOKEN   = "<|image_end|>"     # id = 21
+# ── Token IDs ─────────────────────────────────────────────────────────────────
+IMG_START_ID = 18
+IMG_PAD_ID   = 19
+IMG_BREAK_ID = 20
+IMG_END_ID   = 21
+# ── Default config (from config.json / processor_config.json) ─────────────────
+DEFAULT_PATCH_SIZE         = 14
+DEFAULT_SPATIAL_MERGE_SIZE = 2
+DEFAULT_MAX_IMAGE_SIZE     = 1400   # longest edge
+# Allow override via environment variable (e.g. from run_all_benchmarks.sh)
+_env_max = os.environ.get("DEFAULT_MAX_IMAGE_SIZE")
+if _env_max is not None and str(_env_max).strip():
+    try:
+        DEFAULT_MAX_IMAGE_SIZE = int(_env_max)
+    except ValueError:
+        pass
+DATASET_MEAN = (0.48145466, 0.4578275, 0.40821073)   # RGB
+DATASET_STD  = (0.26862954, 0.26130258, 0.27577711)   # RGB
+# ══════════════════════════════════════════════════════════════════════════════
+# Image loading  (mirrors mistral_common.tokens.tokenizers.image)
+# ══════════════════════════════════════════════════════════════════════════════
+def _convert_to_rgb(image: Image.Image) -> Image.Image:
+    """Convert PIL image to RGB; transparent backgrounds become white."""
+    if image.mode == "RGB":
+        return image
+    if image.mode != "RGBA":
+        image = image.convert("RGBA")
+    white_bg = Image.new("RGBA", image.size, "WHITE")
+    white_bg.paste(image, (0, 0), image)
+    return white_bg.convert("RGB")
+def load_image(source: Union[str, Image.Image]) -> Image.Image:
+    """Load an image from a URL, local file path, or PIL Image."""
+    if isinstance(source, Image.Image):
+        return source
+    if source.startswith(("http://", "https://")):
+        resp = requests.get(source, stream=True, timeout=30)
+        resp.raise_for_status()
+        return Image.open(BytesIO(resp.content))
+    return Image.open(source)
+# ══════════════════════════════════════════════════════════════════════════════
+# Core logic — ported from mistral_common ImageEncoder
+# ══════════════════════════════════════════════════════════════════════════════
+def _image_to_num_tokens(
+    img: Image.Image,
+    image_patch_size: int = DEFAULT_PATCH_SIZE,
+    max_image_size: int = DEFAULT_MAX_IMAGE_SIZE,
+    spatial_merge_size: int = DEFAULT_SPATIAL_MERGE_SIZE,
+) -> Tuple[int, int]:
+    """
+    Compute (width_tokens, height_tokens) for a given image — identical to
+    ``mistral_common.tokens.tokenizers.image.ImageEncoder._image_to_num_tokens``.
+    """
+    w, h = img.size                                     # PIL: (W, H)
+    ratio = max(h / max_image_size, w / max_image_size)
+    if ratio > 1:
+        w = round(w / ratio)
+        h = round(h / ratio)
+    width_tokens  = (w - 1) // (image_patch_size * spatial_merge_size) + 1
+    height_tokens = (h - 1) // (image_patch_size * spatial_merge_size) + 1
+    return width_tokens, height_tokens
+def transform_image(
+    image: Image.Image,
+    new_size: Tuple[int, int],
+    mean: Tuple[float, ...] = DATASET_MEAN,
+    std: Tuple[float, ...] = DATASET_STD,
+) -> np.ndarray:
+    """
+    Resize + normalise — identical to
+    ``mistral_common.tokens.tokenizers.image.transform_image``.
+    Args:
+        image:    PIL Image (any mode).
+        new_size: Target (W, H) — cv2 convention.
+    Returns:
+        np.ndarray of shape (C, H, W), float32, normalised.
+    """
+    np_image = cv2.resize(
+        np.array(_convert_to_rgb(image), dtype=np.float32),
+        new_size,
+        interpolation=cv2.INTER_CUBIC,
+    )
+    np_image = np_image / 255.0
+    np_image = (np_image - np.array(mean, dtype=np.float32)) / np.array(std, dtype=np.float32)
+    return np_image.transpose(2, 0, 1)
+def encode_image(
+    image: Image.Image,
+    image_patch_size: int = DEFAULT_PATCH_SIZE,
+    max_image_size: int = DEFAULT_MAX_IMAGE_SIZE,
+    spatial_merge_size: int = DEFAULT_SPATIAL_MERGE_SIZE,
+) -> Tuple[int, int, np.ndarray]:
+    """
+    Compute token dimensions **and** preprocessed pixel array for one image.
+    Returns:
+        (width_tokens, height_tokens, pixel_array)
+        where pixel_array has shape (C, H, W).
+    """
+    w_tok, h_tok = _image_to_num_tokens(
+        image, image_patch_size, max_image_size, spatial_merge_size,
+    )
+    assert w_tok > 0 and h_tok > 0
+    new_w = w_tok * image_patch_size * spatial_merge_size
+    new_h = h_tok * image_patch_size * spatial_merge_size
+    processed = transform_image(image, (new_w, new_h))   # cv2: (W, H)
+    return w_tok, h_tok, processed
+# ══════════════════════════════════════════════════════════════════════════════
+# Token string expansion
+# ══════════════════════════════════════════════════════════════════════════════
+def build_image_token_str(w_tokens: int, h_tokens: int) -> str:
+    """
+    Build the expanded image-token string for one image.
+    Pattern:
+        [IMG_START]
+          ([IMG]*W  [IMG_BREAK]) * (H-1)
+           [IMG]*W  [IMG_END]
+    """
+    row = IMG_PAD_TOKEN * w_tokens + IMG_BREAK_TOKEN
+    body = row * h_tokens
+    body = body[: -len(IMG_BREAK_TOKEN)] + IMG_END_TOKEN
+    return IMG_START_TOKEN + body
+# ══════════════════════════════════════════════════════════════════════════════
+# Extract image sources from OpenAI-style messages
+# ══════════════════════════════════════════════════════════════════════════════
+def _extract_image_sources(messages: List[Dict[str, Any]]) -> List[str]:
+    """Walk through OpenAI-style messages and collect image URLs / paths."""
+    sources: List[str] = []
+    for msg in messages:
+        content = msg.get("content", "")
+        if not isinstance(content, list):
+            continue
+        for block in content:
+            btype = block.get("type")
+            if btype == "image_url":
+                url_obj = block.get("image_url", {})
+                sources.append(url_obj.get("url", ""))
+            elif btype == "image":
+                for key in ("url", "path", "image"):
+                    if key in block:
+                        sources.append(block[key])
+                        break
+    return sources
+# ══════════════════════════════════════════════════════════════════════════════
+# Public API
+# ══════════════════════════════════════════════════════════════════════════════
+def process_messages(
+    tokenizer,
+    messages: List[Dict[str, Any]],
+    *,
+    patch_size: int         = DEFAULT_PATCH_SIZE,
+    spatial_merge_size: int = DEFAULT_SPATIAL_MERGE_SIZE,
+    max_image_size: int     = DEFAULT_MAX_IMAGE_SIZE,
+    return_tensors: str     = "pt",
+    add_generation_prompt: bool = False,
+    enable_thinking: bool   = True,
+) -> Dict[str, Any]:
+    """
+    Process chat messages with optional images — drop-in replacement for
+    ``MistralCommonBackend.apply_chat_template(return_dict=True)``.
+    Steps:
+        1. Render Jinja chat template  →  prompt with ``<|image_start|>`` placeholders.
+        2. For each image:
+           a. Load image.
+           b. Compute token dims via ceiling division (matching mistral_common).
+           c. Resize to token-aligned dimensions with cv2 INTER_CUBIC.
+           d. Normalise pixels.
+           e. Replace the next ``<|image_start|>`` placeholder with the expanded
+              token sequence.
+        3. Tokenize the expanded prompt.
+        4. Return dict with ``input_ids`` (and ``pixel_values`` / ``image_sizes``
+           if images are present).
+    Args:
+        enable_thinking: When True (default), the generation prompt opens a
+            ``<think>`` block for chain-of-thought reasoning.  When False,
+            an empty ``<think></think>`` is emitted so the model skips
+            the thinking phase.
+    Returns:
+        dict with keys:
+            input_ids    : LongTensor  (1, seq_len)
+            pixel_values : FloatTensor (N, 3, H, W)  – only when images present
+            image_sizes  : list of (H, W) tuples      – only when images present
+    """
+    # ── 1. Extract image sources ──────────────────────────────────────────
+    image_sources = _extract_image_sources(messages)
+    # ── 2. Render chat template (produces <|image_start|> placeholders) ──
+    prompt: str = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=add_generation_prompt,
+        enable_thinking=enable_thinking,
+    )
+    # ── 3. Expand each placeholder & preprocess images ────────────────────
+    pixel_list: List[np.ndarray] = []
+    image_sizes: List[Tuple[int, int]] = []
+    for src in image_sources:
+        pil_img = load_image(src)
+        w_tok, h_tok, pixels = encode_image(
+            pil_img, patch_size, max_image_size, spatial_merge_size,
+        )
+        expanded = build_image_token_str(w_tok, h_tok)
+        prompt = prompt.replace(IMG_START_TOKEN, expanded, 1)
+        pixel_list.append(pixels)
+        final_h = h_tok * patch_size * spatial_merge_size
+        final_w = w_tok * patch_size * spatial_merge_size
+        image_sizes.append((final_h, final_w))
+    # ── 4. Tokenize ──────────────────────────────────────────────────────
+    if return_tensors == "pt":
+        input_ids = tokenizer(prompt, return_tensors="pt").input_ids
+    else:
+        input_ids = tokenizer(prompt).input_ids
+    result: Dict[str, Any] = {"input_ids": input_ids}
+    if pixel_list:
+        if return_tensors == "pt":
+            result["pixel_values"] = torch.from_numpy(np.stack(pixel_list))
+        else:
+            result["pixel_values"] = np.stack(pixel_list)
+        result["image_sizes"] = image_sizes
+    return result

model-00001-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d6bdc8ac3f1baef94a6a6fe6c5290031495e45ab18a7884f210dd8c157b33582
+size 4984302088

model-00002-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6e4c3f9cac913ee79685bb55f4eee4c7220ae03ad50bfe83131401a0df023706
+size 4999802904

model-00003-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:562159c86be4c9cdde7250b5b78f4636746fd1e82c9e2e10310f74d5696c3503
+size 4915916376

model-00004-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7323f1fdacbf54bee85a2b4b074f88767c8d51754eb12495ce48f3f152048276
+size 2936115968

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,539 @@

+{
+  "metadata": {
+    "total_parameters": 8918034432,
+    "total_size": 17836068864
+  },
+  "weight_map": {
+    "diffusion_head.weight": "model-00004-of-00004.safetensors",
+    "encoder.embed_tokens.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.1.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.1.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.1.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.10.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.10.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.10.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.10.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.10.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.10.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.10.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.10.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.11.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.11.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.11.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.11.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.11.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.11.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.12.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.12.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.13.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.13.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.13.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.13.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.13.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.13.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.14.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.14.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.15.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.15.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.16.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.16.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.16.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.16.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.16.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.16.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.16.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.16.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.16.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.17.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.17.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.17.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.17.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.17.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.17.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.17.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.17.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.17.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.18.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.18.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.18.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.18.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.18.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.18.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.18.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.18.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.18.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.19.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.19.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.19.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.19.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.19.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.19.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.19.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.19.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.19.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.2.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.2.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.20.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.20.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.20.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.20.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.20.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.20.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.20.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.20.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.20.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.21.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.21.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.21.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.21.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.21.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.21.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.21.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.21.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.22.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.22.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.22.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.22.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.23.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.23.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.23.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.23.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.23.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.23.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.23.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.23.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.24.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.24.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.24.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.24.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.24.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.24.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.24.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.24.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.25.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.25.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.25.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.25.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.25.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.25.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.25.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.25.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.25.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.26.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.26.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.26.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.26.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.26.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.26.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.26.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.26.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.26.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.27.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.27.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.27.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.27.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.27.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.27.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.27.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.27.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.27.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.28.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.28.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.28.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.28.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.28.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.28.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.28.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.28.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.28.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.29.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.29.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.29.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.29.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.29.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.29.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.29.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.29.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.29.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "encoder.layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.3.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.3.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.3.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.3.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.30.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.30.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.30.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.30.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.30.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.30.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.30.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.30.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.30.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.31.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.31.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.31.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.31.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.31.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.31.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.31.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.31.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.31.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.32.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.32.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.32.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.32.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.32.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.32.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.32.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.32.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.32.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.33.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.33.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.33.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.33.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.33.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.33.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.33.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.33.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.33.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
+    "encoder.layers.4.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.4.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.4.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.4.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.4.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.4.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.4.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.4.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.5.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.5.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.5.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.5.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.5.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.5.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.5.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.5.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.5.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.6.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.6.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.6.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.6.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.6.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.6.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.6.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.6.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.6.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.layers.7.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.7.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.7.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.7.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.7.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.7.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.7.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.7.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.7.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.8.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.8.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.8.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.8.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.8.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.8.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.8.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.8.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.8.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.9.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.9.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.9.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.9.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.9.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.9.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.9.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.layers.9.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "encoder.multi_modal_projector.linear_1.weight": "model-00001-of-00004.safetensors",
+    "encoder.multi_modal_projector.linear_2.weight": "model-00001-of-00004.safetensors",
+    "encoder.multi_modal_projector.norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.multi_modal_projector.patch_merger.merging_layer.weight": "model-00001-of-00004.safetensors",
+    "encoder.norm.weight": "model-00004-of-00004.safetensors",
+    "encoder.vision_tower.ln_pre.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.patch_conv.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.0.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.0.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.0.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.0.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.0.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.0.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.0.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.0.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.0.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.1.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.1.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.1.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.1.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.1.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.1.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.1.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.1.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.1.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.10.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.10.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.10.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.10.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.10.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.10.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.10.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.10.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.10.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.11.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.11.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.11.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.11.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.11.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.11.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.11.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.11.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.11.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.12.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.12.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.12.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.12.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.12.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.12.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.12.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.12.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.12.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.13.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.13.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.13.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.13.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.13.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.13.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.13.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.13.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.13.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.14.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.14.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.14.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.14.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.14.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.14.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.14.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.14.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.14.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.15.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.15.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.15.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.15.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.15.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.15.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.15.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.15.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.15.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.16.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.16.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.16.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.16.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.16.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.16.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.16.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.16.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.16.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.17.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.17.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.17.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.17.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.17.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.17.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.17.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.17.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.17.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.18.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.18.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.18.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.18.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.18.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.18.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.18.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.18.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.18.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.19.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.19.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.19.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.19.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.19.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.19.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.19.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.19.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.19.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.2.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.2.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.2.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.2.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.2.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.2.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.2.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.2.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.2.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.20.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.20.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.20.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.20.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.20.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.20.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.20.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.20.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.20.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.21.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.21.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.21.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.21.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.21.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.21.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.21.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.21.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.21.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.22.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.22.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.22.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.22.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.22.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.22.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.22.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.22.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.22.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.23.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.23.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.23.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.23.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.23.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.23.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.23.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.23.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.23.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.3.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.3.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.3.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.3.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.3.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.3.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.3.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.3.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.3.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.4.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.4.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.4.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.4.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.4.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.4.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.4.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.4.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.4.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.5.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.5.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.5.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.5.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.5.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.5.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.5.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.5.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.5.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.6.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.6.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.6.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.6.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.6.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.6.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.6.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.6.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.6.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.7.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.7.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.7.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.7.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.7.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.7.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.7.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.7.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.7.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.8.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.8.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.8.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.8.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.8.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.8.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.8.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.8.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.8.ffn_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.9.attention.k_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.9.attention.o_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.9.attention.q_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.9.attention.v_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.9.attention_norm.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.9.feed_forward.down_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.9.feed_forward.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.9.feed_forward.up_proj.weight": "model-00001-of-00004.safetensors",
+    "encoder.vision_tower.transformer.layers.9.ffn_norm.weight": "model-00001-of-00004.safetensors"
+  }
+}

model_cards/bias.md ADDED Viewed

	@@ -0,0 +1,4 @@

+Field                                                                                               |  Response
+:---------------------------------------------------------------------------------------------------|:---------------
+Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing:  |  [None]
+Measures taken to mitigate against unwanted bias:                                                   |  [None]

model_cards/explainability.md ADDED Viewed

	@@ -0,0 +1,13 @@

+Field                                                                                                  |  Response
+:------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
+Intended Task/Domain:                                                                   |  Text generation
+Model Type:                                                                                            |  Transformer
+Intended Users:                                                                                        |  Generative AI creators working with conversational AI models.
+Output:                                                                                                |  Text (Responds to posed question, Stateful - remembers previous answers)
+Describe how the model works:                                                                          |  Text input is encoded into tokens and passed into a transformer-based language model, which returns a text response.
+Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:  |  Not Applicable
+Technical Limitations & Mitigation:                                                                    |  The model cannot perform long-horizon reasoning and tool calling.
+Verified to have met prescribed NVIDIA quality standards:  |  Yes
+Performance Metrics:                                                                                   |  Accuracy, Latency, Throughput
+Potential Known Risks:                                                                                 |  In some instances, the model may think too long and struggle to derive final answers. The model's output can generate all forms of text, including what may be considered toxic, offensive, or indecent.
+Licensing:                                                                                             |  nvidia-open-model-license.

model_cards/privacy.md ADDED Viewed

	@@ -0,0 +1,11 @@

+Field                                                                                                                              |  Response
+:----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
+Generatable or reverse engineerable personal data?                                                     |  [No]
+Personal data used to create this model?                                                                                       |  [No]
+Was consent obtained for any personal data used?                                                                                             |  [Not Applicable]
+How often is dataset reviewed?                                                                                                     |  [During dataset creation, model training, evaluation, and the prerelease phase.]
+Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? |  [Yes]
+Is there provenance for all datasets used in training?                                                                                |  Yes
+Does data labeling (annotation, metadata) comply with privacy laws?                                                                |  Yes
+Is data compliant with data subject requests for data correction or removal, if such a request was made?                           | Not Applicable.
+Applicable Privacy Policy        | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/

model_cards/safety.md ADDED Viewed

	@@ -0,0 +1,6 @@

+Field                                               |  Response
+:---------------------------------------------------|:----------------------------------
+Model Application Field(s):                               |  [Media & Entertainment].
+Describe the life critical impact (if present).   |  Not Applicable
+Model and dataset restrictions:            |  The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development.  Restrictions enforce dataset access during training, and dataset license constraints adhered to.
+Use Case Restrictions: | Abide by nvidia-open-model-license.

modeling_ministral.py ADDED Viewed

	@@ -0,0 +1,629 @@

+from collections.abc import Callable
+from typing import Optional, Union
+import torch
+from torch import nn
+from transformers.utils.generic import check_model_inputs
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache
+from transformers.generation import GenerationMixin
+# from transformers.integrations import use_kernel_forward_from_hub, use_kernel_func_from_hub, use_kernelized_func
+from transformers.integrations import use_kernel_forward_from_hub
+from transformers.masking_utils import create_causal_mask, create_sliding_window_causal_mask, ALL_MASK_ATTENTION_FUNCTIONS, sdpa_mask_older_torch
+from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
+from transformers.modeling_layers import (
+    GenericForQuestionAnswering,
+    GenericForSequenceClassification,
+    GenericForTokenClassification,
+    GradientCheckpointingLayer,
+)
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
+from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from transformers.processing_utils import Unpack
+from transformers.utils import TransformersKwargs, auto_docstring, can_return_tuple
+from transformers.models.pixtral.modeling_pixtral import PixtralVisionModel
+from transformers.models.pixtral.configuration_pixtral import PixtralVisionConfig
+# from transformers.utils.generic import maybe_autocast
+from .configuration_nemotron_labs_diffusion_vlm import NemotronLabsDiffusionVLMConfig
+#ALL_MASK_ATTENTION_FUNCTIONS._global_mapping['sdpa'] = sdpa_mask_older_torch
+class Ministral3PatchMerger(nn.Module):
+    """
+    Learned merging of spatial_merge_size ** 2 patches
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        hidden_size = config.vision_config.hidden_size
+        self.spatial_merge_size = config.spatial_merge_size
+        self.patch_size = self.config.vision_config.patch_size
+        self.merging_layer = nn.Linear(hidden_size * self.spatial_merge_size**2, hidden_size, bias=False)
+    def forward(self, image_features: torch.Tensor, image_sizes: torch.Tensor) -> torch.Tensor:
+        image_sizes = [
+            (image_size[0] // self.patch_size, image_size[1] // self.patch_size) for image_size in image_sizes
+        ]
+        tokens_per_image = [h * w for h, w in image_sizes]
+        d = image_features.shape[-1]
+        permuted_tensor = []
+        for image_index, image_tokens in enumerate(image_features.split(tokens_per_image)):
+            # Reshape image_tokens into a 2D grid
+            h, w = image_sizes[image_index]
+            image_grid = image_tokens.view(h, w, d).permute(2, 0, 1).unsqueeze(0)
+            grid = torch.nn.functional.unfold(
+                image_grid, kernel_size=self.spatial_merge_size, stride=self.spatial_merge_size
+            )
+            grid = grid.view(d * self.spatial_merge_size**2, -1).t()
+            permuted_tensor.append(grid)
+        image_features = torch.cat(permuted_tensor, dim=0)
+        image_features = self.merging_layer(image_features)
+        return image_features
+class Ministral3MultiModalProjector(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.norm = Ministral3RMSNorm(config.vision_config.hidden_size, eps=config.rms_norm_eps)
+        self.patch_merger = Ministral3PatchMerger(config)
+        # We have hidden_size * the number of vision feature layers
+        self.num_feature_layers = (
+            1 if isinstance(config.vision_feature_layer, int) else len(config.vision_feature_layer)
+        )
+        self.linear_1 = nn.Linear(
+            config.vision_config.hidden_size * self.num_feature_layers,
+            config.hidden_size,
+            bias=config.multimodal_projector_bias,
+        )
+        self.act = ACT2FN[config.projector_hidden_act]
+        self.linear_2 = nn.Linear(
+            config.hidden_size, config.hidden_size, bias=config.multimodal_projector_bias
+        )
+    def forward(self, image_features: torch.Tensor, image_sizes: torch.Tensor):
+        image_features = self.norm(image_features)
+        image_features = self.patch_merger(image_features, image_sizes)
+        hidden_states = self.linear_1(image_features)
+        hidden_states = self.act(hidden_states)
+        hidden_states = self.linear_2(hidden_states)
+        return hidden_states
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+# @use_kernel_func_from_hub("rotary_pos_emb")
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+def eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: float,
+    dropout: float = 0.0,
+    **kwargs: Unpack[TransformersKwargs],
+):
+    key_states = repeat_kv(key, module.num_key_value_groups)
+    value_states = repeat_kv(value, module.num_key_value_groups)
+    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
+    if attention_mask is not None:
+        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+        attn_weights = attn_weights + causal_mask
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value_states)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+    return attn_output, attn_weights
+def _get_llama_4_attn_scale(positions_ids: torch.Tensor, beta: float, max_position_embeddings: int) -> torch.Tensor:
+    scaling = 1 + beta * torch.log(1 + torch.floor(positions_ids / max_position_embeddings))
+    return scaling.unsqueeze(-1)
+# @use_kernelized_func(apply_rotary_pos_emb)
+class Ministral3Attention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: NemotronLabsDiffusionVLMConfig, layer_idx: int):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.head_dim = getattr(config, "head_dim", None) or config.hidden_size // config.num_attention_heads
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.is_causal = True
+        self.q_proj = nn.Linear(config.hidden_size, config.num_attention_heads * self.head_dim, bias=False)
+        self.k_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=False)
+        self.v_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=False)
+        self.o_proj = nn.Linear(config.num_attention_heads * self.head_dim, config.hidden_size, bias=False)
+        self.diffusion_lm = config.diffusion_lm
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor],
+        attention_mask: Optional[torch.Tensor],
+        past_key_values: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = False,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        query_states = query_states * _get_llama_4_attn_scale(
+            cache_position,
+            self.config.rope_parameters.get("llama_4_scaling_beta"),
+            self.config.rope_parameters.get("original_max_position_embeddings"),
+        ).to(query_states.dtype)
+        if past_key_values is not None:
+            if use_cache:
+                # sin and cos are specific to RoPE models; cache_position needed for the static cache
+                cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+                key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
+            else:  ## if use_cache == False, do not update cache
+                old_k, old_v = past_key_values.layers[self.layer_idx].keys, past_key_values.layers[self.layer_idx].values
+                key_states   = torch.cat([old_k, key_states], dim=-2)
+                value_states = torch.cat([old_v, value_states], dim=-2)
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+        if self.diffusion_lm:
+            attn_output, attn_weights = attention_interface(
+                self,
+                query_states,
+                key_states,
+                value_states,
+                None,
+                dropout=0.0 if not self.training else self.attention_dropout,
+                scaling=self.scaling,
+                is_causal=False,
+                **kwargs,
+            )
+        else:
+            attn_output, attn_weights = attention_interface(
+                self,
+                query_states,
+                key_states,
+                value_states,
+                attention_mask,
+                dropout=0.0 if not self.training else self.attention_dropout,
+                scaling=self.scaling,
+                sliding_window=getattr(self.config, "sliding_window", None),  # main diff with Llama
+                **kwargs,
+            )
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+class Ministral3MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+@use_kernel_forward_from_hub("RMSNorm")
+class Ministral3RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        Ministral3RMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
+class Ministral3DecoderLayer(GradientCheckpointingLayer):
+    def __init__(self, config: NemotronLabsDiffusionVLMConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        if hasattr(config, 'attn_class'):
+            attn_class = config.attn_class
+        else:
+            attn_class = Ministral3Attention
+        self.self_attn = attn_class(config=config, layer_idx=layer_idx)
+        self.mlp = Ministral3MLP(config)
+        self.input_layernorm = Ministral3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = Ministral3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> torch.Tensor:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, _ = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        return hidden_states
+@auto_docstring
+class Ministral3PreTrainedModel(PreTrainedModel):
+    config: NemotronLabsDiffusionVLMConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    # Ministral3RMSNorm must be a separate FSDP unit to avoid weight sharded to size 0 on some ranks
+    _no_split_modules = ["Ministral3DecoderLayer", "Ministral3RMSNorm"]
+    _skip_keys_device_placement = ["past_key_values"]
+    _supports_flash_attn = True
+    _supports_sdpa = True
+    _supports_flex_attn = True
+    _can_compile_fullgraph = True
+    _supports_attention_backend = True
+    _can_record_outputs = {
+        "hidden_states": Ministral3DecoderLayer,
+        "attentions": Ministral3Attention,
+    }
+class Ministral3RotaryEmbedding(nn.Module):
+    inv_freq: torch.Tensor  # fix linting for `register_buffer`
+    def __init__(self, config: NemotronLabsDiffusionVLMConfig, device=None):
+        super().__init__()
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+        self.config = config
+        self.rope_type = self.config.rope_parameters["rope_type"]
+        rope_init_fn: Callable = self.compute_default_rope_parameters
+        if self.rope_type != "default":
+            rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+        inv_freq, self.attention_scaling = rope_init_fn(self.config, device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = inv_freq
+    @staticmethod
+    def compute_default_rope_parameters(
+        config: Optional[NemotronLabsDiffusionVLMConfig] = None,
+        device: Optional["torch.device"] = None,
+        seq_len: Optional[int] = None,
+    ) -> tuple["torch.Tensor", float]:
+        """
+        Computes the inverse frequencies according to the original RoPE implementation
+        Args:
+            config ([`~transformers.PreTrainedConfig`]):
+                The model configuration.
+            device (`torch.device`):
+                The device to use for initialization of the inverse frequencies.
+            seq_len (`int`, *optional*):
+                The current sequence length. Unused for this type of RoPE.
+        Returns:
+            Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
+            post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
+        """
+        base = config.rope_parameters["rope_theta"]
+        dim = getattr(config, "head_dim", None) or config.hidden_size // config.num_attention_heads
+        attention_factor = 1.0  # Unused in this type of RoPE
+        # Compute the inverse frequencies
+        inv_freq = 1.0 / (
+            base ** (torch.arange(0, dim, 2, dtype=torch.int64).to(device=device, dtype=torch.float) / dim)
+        )
+        return inv_freq, attention_factor
+    @torch.no_grad()
+    @dynamic_rope_update  # power user: used with advanced RoPE types (e.g. dynamic rope)
+    def forward(self, x, position_ids):
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
+        position_ids_expanded = position_ids[:, None, :].float()
+        # device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
+        # with maybe_autocast(device_type=device_type, enabled=False):  # Force float32
+        freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+        emb = torch.cat((freqs, freqs), dim=-1)
+        cos = emb.cos() * self.attention_scaling
+        sin = emb.sin() * self.attention_scaling
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+@auto_docstring
+class Ministral3Model(Ministral3PreTrainedModel):
+    def __init__(self, config: NemotronLabsDiffusionVLMConfig):
+        super().__init__(config)
+        vision_config = config.vision_config
+        if not isinstance(vision_config, PixtralVisionConfig):
+            vision_config = PixtralVisionConfig(**vision_config) if isinstance(vision_config, dict) else PixtralVisionConfig(**vars(vision_config))
+            config.vision_config = vision_config
+        self.vision_tower = PixtralVisionModel(vision_config)
+        self.multi_modal_projector = Ministral3MultiModalProjector(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList(
+            [Ministral3DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = Ministral3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.rotary_emb = Ministral3RotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+    @check_model_inputs
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> BaseModelOutputWithPast:
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        if use_cache and past_key_values is None:
+            # past_key_values = DynamicCache(config=self.config)
+            past_key_values = DynamicCache()
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position = torch.arange(
+                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+            )
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+        if kwargs.get("use_causal_mask", False):
+            mask_function = create_causal_mask if self.config.sliding_window is None else create_sliding_window_causal_mask
+            causal_mask = mask_function(
+                config=self.config,
+                input_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                cache_position=cache_position,
+                past_key_values=past_key_values,
+                position_ids=position_ids,
+            )
+        else:
+            causal_mask = None
+        hidden_states = inputs_embeds
+        position_embeddings = self.rotary_emb(hidden_states, position_ids=position_ids)
+        for decoder_layer in self.layers[: self.config.num_hidden_layers]:
+            hidden_states = decoder_layer(
+                hidden_states,
+                attention_mask=causal_mask,
+                position_ids=position_ids,
+                past_key_values=past_key_values,
+                use_cache=use_cache,
+                cache_position=cache_position,
+                position_embeddings=position_embeddings,
+                **kwargs,
+            )
+        hidden_states = self.norm(hidden_states)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values if use_cache else None,
+        )
+@auto_docstring
+class Ministral3ForCausalLM(Ministral3PreTrainedModel, GenerationMixin):
+    _tied_weights_keys = {"lm_head.weight": "model.embed_tokens.weight"}
+    _tp_plan = {"lm_head": "colwise_rep"}
+    _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = Ministral3Model(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    @can_return_tuple
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> CausalLMOutputWithPast:
+        r"""
+        Example:
+        ```python
+        >>> from transformers import AutoTokenizer, Ministral3ForCausalLM
+        >>> model = Ministral3ForCausalLM.from_pretrained("meta-ministral3/Ministral3-2-7b-hf")
+        >>> tokenizer = AutoTokenizer.from_pretrained("meta-ministral3/Ministral3-2-7b-hf")
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        outputs: BaseModelOutputWithPast = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        hidden_states = outputs.last_hidden_state
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class Ministral3ForTokenClassification(GenericForTokenClassification, Ministral3PreTrainedModel):
+    pass
+class Ministral3ForSequenceClassification(GenericForSequenceClassification, Ministral3PreTrainedModel):
+    pass
+class Ministral3ForQuestionAnswering(GenericForQuestionAnswering, Ministral3PreTrainedModel):
+    pass
+__all__ = [
+    "Ministral3ForCausalLM",
+    "Ministral3ForQuestionAnswering",
+    "Ministral3Model",
+    "Ministral3PreTrainedModel",
+    "Ministral3ForSequenceClassification",
+    "Ministral3ForTokenClassification",
+]

modeling_nemotron_labs_diffusion_vlm.py ADDED Viewed

	@@ -0,0 +1,1378 @@

+import copy
+from dataclasses import dataclass
+from typing import Callable, Optional, Tuple, Union
+import random
+import os
+import sys
+import json
+import numpy as np
+import torch
+import torch.nn.functional as F
+from torch import nn
+from transformers.modeling_outputs import CausalLMOutputWithPast, BaseModelOutput
+from transformers.utils import ModelOutput
+from torch.nn.attention.flex_attention import BlockMask, flex_attention, create_block_mask, or_masks
+from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
+from transformers.processing_utils import Unpack
+from transformers.cache_utils import Cache, DynamicCache
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+from transformers.generation import GenerationMixin
+from transformers.loss.loss_utils import LOSS_MAPPING
+import math
+from .chat_utils import generate_with_prefix_cache_block_diff
+from .modeling_ministral import Ministral3Model, Ministral3PreTrainedModel, Ministral3Attention, apply_rotary_pos_emb, repeat_kv, _get_llama_4_attn_scale
+from .configuration_nemotron_labs_diffusion_vlm import NemotronLabsDiffusionVLMConfig
+@dataclass
+class NemotronLabsDiffusionVLMOutputWithPast(ModelOutput):
+    loss: torch.FloatTensor | None = None
+    logits: torch.FloatTensor | None = None
+    causal_logits: torch.FloatTensor | None = None
+    past_key_values: Cache | None = None
+    hidden_states: tuple[torch.FloatTensor, ...] | None = None
+    attentions: tuple[torch.FloatTensor, ...] | None = None
+# @torch.compile(dynamic=True, mode="reduce-overhead")
+# @torch.compile(mode="default")
+# @torch.compile(fullgraph=True, mode="reduce-overhead", dynamic=False)
+@torch.compile(fullgraph=True, mode="max-autotune-no-cudagraphs", dynamic=False)
+def fused_flex_attention(q, k, v, block_mask=None):
+    return flex_attention(q, k, v, block_mask=block_mask)
+def _crop_dynamic_cache(past_key_values: DynamicCache, max_length: int):
+    """Crop a DynamicCache to max_length, compatible with both old and new transformers."""
+    if hasattr(past_key_values, 'crop'):
+        past_key_values.crop(max_length)
+    else:
+        for layer_idx in range(len(past_key_values)):
+            past_key_values.key_cache[layer_idx] = past_key_values.key_cache[layer_idx][:, :, :max_length]
+            past_key_values.value_cache[layer_idx] = past_key_values.value_cache[layer_idx][:, :, :max_length]
+        past_key_values._seen_tokens = max_length
+def _extract_draft_kv_cache(past_key_values: DynamicCache, clean_len: int, block_length: int):
+    """After quadratic decoding, extract only draft tokens (first of each block) from cache."""
+    for layer_idx in range(len(past_key_values)):
+        if hasattr(past_key_values, 'layers'):
+            layer_cache = past_key_values.layers[layer_idx]
+            k, v = layer_cache.keys, layer_cache.values
+        else:
+            k = past_key_values.key_cache[layer_idx]
+            v = past_key_values.value_cache[layer_idx]
+        clean_k, draft_k = k[:, :, :clean_len], k[:, :, clean_len::block_length + 1]
+        clean_v, draft_v = v[:, :, :clean_len], v[:, :, clean_len::block_length + 1]
+        new_k = torch.cat([clean_k, draft_k], dim=2)
+        new_v = torch.cat([clean_v, draft_v], dim=2)
+        if hasattr(past_key_values, 'layers'):
+            layer_cache.keys = new_k
+            layer_cache.values = new_v
+        else:
+            past_key_values.key_cache[layer_idx] = new_k
+            past_key_values.value_cache[layer_idx] = new_v
+    past_key_values._seen_tokens = clean_len + block_length
+# with reference to https://github.com/pytorch-labs/attention-gym/blob/main/examples/flex_attn.ipynb
+class NemotronLabsDiffusionVLMFlexAttention(Ministral3Attention):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.max_seq_length = getattr(self.config, 'max_seq_length', 4096)
+        self.block_size_orig = self.config.block_size
+        if self.config.dlm_paradigm == 'bidirectional':
+            self.bidirectional_mask = self.compute_block_mask(mode='bidirectional')
+        elif self.config.dlm_paradigm == 'autoregressive':
+            self.autoregressive_mask = self.compute_block_mask(mode='autoregressive')
+        elif self.config.dlm_paradigm == 'block_diff':
+            self.block_diff_mask = None
+        elif self.config.dlm_paradigm == 'sbd_block_diff':
+            self.sbd_block_diff_mask = None
+        else:
+            raise ValueError(f"Unknown attention mode: {self.config.dlm_paradigm}")
+        self.block_size = self.block_size_orig
+        self.mode = self.config.dlm_paradigm
+        self._quadratic_block_mask = {}
+        import torch._dynamo.config as dcfg
+        dcfg.cache_size_limit = 512
+    def _get_sbd_inference_quadratic_decoding_block_mask(self, block_length: int):
+        if block_length not in self._quadratic_block_mask:
+            draft_len = block_length * (block_length + 1)
+            def quadratic(b, h, q_idx, kv_idx):
+                first_clean = torch.logical_and(
+                    kv_idx % (block_length + 1) == 0,
+                    kv_idx < draft_len,
+                )
+                first_clean = torch.logical_and(first_clean, q_idx >= kv_idx)
+                block_q = q_idx // (block_length + 1)
+                block_kv = kv_idx // (block_length + 1)
+                same_block = torch.logical_and(block_q == block_kv, q_idx < draft_len)
+                same_block_except_first = torch.logical_and(
+                    same_block,
+                    q_idx % (block_length + 1) != 0,
+                )
+                draft_part = torch.logical_or(first_clean, same_block_except_first)
+                clean_part = kv_idx >= draft_len
+                return torch.logical_or(draft_part, clean_part)
+            block_mask = create_block_mask(
+                quadratic,
+                B=None,
+                H=None,
+                Q_LEN=draft_len,
+                KV_LEN=draft_len + self.config.max_position_embeddings,
+                device="cuda",
+            )
+            self._quadratic_block_mask[block_length] = block_mask
+        return self._quadratic_block_mask[block_length]
+    def set_attention_mode(self, mode, block_size=None):
+        self.mode = mode
+        self.block_size = block_size
+    def compute_block_mask(self, mode, q_len=None, block_size=None):
+        def bidirectional_mask(b, h, q, kv):
+            return (q >= kv) | (q < kv)
+        def autoregressive_mask(b, h, q, kv):
+            return (q >= kv)
+        def block_diff_mask(block_size, b, h, q_idx, kv_idx, n):
+            """
+            Constructs the specialized block diffusion attention mask for training
+            composed of three masks:
+            - **Block Diagonal Mask (M_BD)**: Self-attention within noised blocks
+            - **Offset Block Causal Mask (M_OBC)**: Cross-attention for conditional context
+            - **Block Causal Mask (M_BC)**: Attention to update x0
+            Args:
+                b, h: Batch and head indices (ignored for mask logic).
+                q_idx, kv_idx: Query and Key indices.
+                seq_len: Total sequence length.
+                block_size: Defines the block structure.
+            Returns:
+                A boolean attention mask.
+            """
+            # Indicate whether token belongs to xt or x0
+            x0_flag_q = (q_idx >= n)
+            x0_flag_kv = (kv_idx >= n)
+            # Compute block indices
+            block_q = torch.where(x0_flag_q == 1,
+                                    (q_idx - n) // block_size,
+                                    q_idx // block_size)
+            block_kv = torch.where(x0_flag_kv == 1,
+                                    (kv_idx - n) // block_size,
+                                    kv_idx // block_size)
+            # **1. Block Diagonal Mask (M_BD) **
+            block_diagonal = (block_q == block_kv) & (x0_flag_q == x0_flag_kv)
+            # **2. Offset Block-Causal Mask (M_OBC) **
+            offset_block_causal = (
+                (block_q > block_kv)
+                & (x0_flag_kv == 1)
+                & (x0_flag_q == 0)
+            )
+            # **3. Block-Causal Mask (M_BC) **
+            block_causal = (block_q >= block_kv) & (x0_flag_kv == 1) & (x0_flag_q == 1)
+            # **4. Combine Masks **
+            return block_diagonal | offset_block_causal | block_causal
+        def sbd_block_diff_mask(block_size, b, h, q_idx, kv_idx, n):
+            x0_flag_q = (q_idx >= n)
+            x0_flag_kv = (kv_idx >= n)
+            # Compute block indices
+            block_q = torch.where(x0_flag_q == 1,
+                                    (q_idx - n) // block_size,
+                                    q_idx // block_size)
+            block_kv = torch.where(x0_flag_kv == 1,
+                                    (kv_idx - n) // block_size,
+                                    kv_idx // block_size)
+            # **1. Block Diagonal Mask (M_BD) **
+            block_diagonal = (block_q == block_kv) & (x0_flag_kv == 0) & (x0_flag_q == 0)
+            # **2. Offset Block-Causal Mask (M_OBC) **
+            offset_block_causal = (
+                (block_q > block_kv)
+                & (x0_flag_kv == 1)
+                & (x0_flag_q == 0)
+            )
+            # **3. Fully Causal Mask (M_BC) **
+            fully_causal = (q_idx >= kv_idx) & (x0_flag_kv == 1) & (x0_flag_q == 1)
+            # **4. Combine Masks **
+            return block_diagonal | offset_block_causal | fully_causal
+        if mode == 'bidirectional':
+            attn_mask = bidirectional_mask
+        elif mode == 'autoregressive':
+            attn_mask = autoregressive_mask
+        elif mode == 'block_diff':
+            assert block_size is not None
+            n = (q_len // 2) if q_len is not None else self.max_seq_length
+            attn_mask = lambda b, h, q, kv: block_diff_mask(block_size, b, h, q, kv, n)
+        elif mode == 'sbd_block_diff':
+            assert block_size is not None
+            n = (q_len // 2) if q_len is not None else self.max_seq_length
+            attn_mask = lambda b, h, q, kv: sbd_block_diff_mask(block_size, b, h, q, kv, n)
+        else:
+            raise ValueError(f"Unknown attention mode: {mode}")
+        if q_len is not None:
+            Q_LEN = q_len
+        else:
+            if mode in ['block_diff', 'sbd_block_diff']:
+                Q_LEN = self.max_seq_length * 2
+            else:
+                Q_LEN = self.max_seq_length
+        block_mask = create_block_mask(
+            attn_mask, B=None, H=None, Q_LEN=Q_LEN, KV_LEN=Q_LEN
+        )
+        return block_mask
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: Tuple[torch.Tensor, torch.Tensor],
+        attention_mask: Optional[torch.Tensor],
+        past_key_values: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        is_training: bool = True,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        bsz, q_len, _ = hidden_states.size()
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        cos, sin = position_embeddings
+        if self.mode in ['block_diff', 'sbd_block_diff'] and is_training:
+            # Split query and key states in half along sequence length dimension
+            q1, q2 = query_states.chunk(2, dim=2)
+            k1, k2 = key_states.chunk(2, dim=2)
+            # Apply RoPE independently to each half
+            q1, k1 = apply_rotary_pos_emb(q1, k1, cos, sin)
+            q2, k2 = apply_rotary_pos_emb(q2, k2, cos, sin)
+            # Recombine the halves
+            query_states = torch.cat([q1, q2], dim=2)
+            key_states = torch.cat([k1, k2], dim=2)
+        else:
+            query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        query_states = query_states * _get_llama_4_attn_scale(
+            cache_position,
+            self.config.rope_parameters.get("llama_4_scaling_beta"),
+            self.config.rope_parameters.get("original_max_position_embeddings"),
+        ).to(query_states.dtype)
+        if past_key_values is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        self_spec_inference_mode = getattr(self.config, "self_spec_inference_mode", None)
+        if self_spec_inference_mode is not None:
+            if self_spec_inference_mode == "quadratic":
+                block_length = getattr(self.config, "block_length", None) or getattr(self.config, "block_size", None)
+                if block_length is None:
+                    raise ValueError("SBD quadratic decoding requires block_length in config.")
+                if past_key_values is not None:
+                    seq_len = key_states.shape[2]
+                    draft_len = block_length * (block_length + 1)
+                    clean_keys = key_states[:, :, :-draft_len]
+                    draft_keys = key_states[:, :, -draft_len:]
+                    clean_values = value_states[:, :, :-draft_len]
+                    draft_values = value_states[:, :, -draft_len:]
+                    key_states = torch.cat([draft_keys, clean_keys], dim=2)
+                    value_states = torch.cat([draft_values, clean_values], dim=2)
+                    block_mask = self._get_sbd_inference_quadratic_decoding_block_mask(
+                        block_length=block_length
+                    )
+                    block_mask.seq_lengths = (draft_len, seq_len)
+                else:
+                    seq_len = query_states.shape[2]
+                    draft_len = block_length * (block_length + 1)
+                    clean_len = seq_len - draft_len
+                    def _causal_mask(b, h, q_idx, kv_idx):
+                        return torch.logical_and(q_idx >= kv_idx, q_idx < clean_len)
+                    def _draft2clean_mask(b, h, q_idx, kv_idx):
+                        full_clean = torch.logical_and(q_idx >= clean_len, kv_idx <= clean_len)
+                        first_clean = torch.logical_and(
+                            q_idx >= clean_len, (kv_idx - clean_len) % (block_length + 1) == 0
+                        )
+                        first_clean = torch.logical_and(first_clean, q_idx >= kv_idx)
+                        return torch.logical_or(full_clean, first_clean)
+                    def _draft_mask(b, h, q_idx, kv_idx):
+                        block_q = (q_idx - clean_len) // (block_length + 1)
+                        block_kv = (kv_idx - clean_len) // (block_length + 1)
+                        quadrant = torch.logical_and(q_idx >= clean_len, kv_idx >= clean_len)
+                        same_block = torch.logical_and(block_q == block_kv, quadrant)
+                        same_block_except_first = torch.logical_and(
+                            same_block,
+                            (q_idx - clean_len) % (block_length + 1) != 0,
+                        )
+                        return torch.logical_and(block_q == block_kv, same_block_except_first)
+                    mask = or_masks(_causal_mask, _draft2clean_mask)
+                    mask = or_masks(mask, _draft_mask)
+                    block_mask = create_block_mask(
+                        mask, B=None, H=None, Q_LEN=seq_len, KV_LEN=seq_len,
+                    )
+                key_states = repeat_kv(key_states, self.num_key_value_groups)
+                value_states = repeat_kv(value_states, self.num_key_value_groups)
+                attn_output = flex_attention(query_states, key_states, value_states, block_mask=block_mask)
+                attn_output = attn_output.transpose(1, 2).reshape(*input_shape, -1).contiguous()
+                attn_output = self.o_proj(attn_output)
+                return attn_output, None
+            elif self_spec_inference_mode == "default":
+                block_length = getattr(self.config, "block_length", None) or getattr(self.config, "block_size", None)
+                if block_length is None:
+                    raise ValueError("SBD default decoding requires block_length in config.")
+                seq_len = query_states.shape[2]
+                prefix_len = seq_len - block_length
+                def _clean_q_mask(b, h, q_idx, kv_idx):
+                    return torch.logical_and(q_idx >= kv_idx, q_idx < prefix_len)
+                def _noisy_q_mask(b, h, q_idx, kv_idx):
+                    return q_idx >= prefix_len
+                block_mask = create_block_mask(
+                    or_masks(_clean_q_mask, _noisy_q_mask),
+                    B=None,
+                    H=None,
+                    Q_LEN=seq_len,
+                    KV_LEN=seq_len,
+                )
+                key_states = repeat_kv(key_states, self.num_key_value_groups)
+                value_states = repeat_kv(value_states, self.num_key_value_groups)
+                attn_output = flex_attention(query_states, key_states, value_states, block_mask=block_mask)
+                attn_output = attn_output.transpose(1, 2).reshape(*input_shape, -1).contiguous()
+                attn_output = self.o_proj(attn_output)
+                return attn_output, None
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+        if self.mode == 'bidirectional':
+            if self.bidirectional_mask is None or q_len != self.bidirectional_mask.shape[-2]:
+                block_mask = self.compute_block_mask(mode='bidirectional', q_len=q_len)
+            else:
+                block_mask = self.bidirectional_mask
+        elif self.mode == 'autoregressive':
+            if self.autoregressive_mask is None or q_len != self.autoregressive_mask.shape[-2]:
+                block_mask = self.compute_block_mask(mode='autoregressive', q_len=q_len)
+            else:
+                block_mask = self.autoregressive_mask
+        elif self.mode == 'block_diff':
+            if self.block_diff_mask is None or self.block_size != self.block_size_orig or q_len != self.block_diff_mask.shape[-2]:
+                block_mask = self.compute_block_mask(mode='block_diff', block_size=self.block_size, q_len=q_len)
+            else:
+                block_mask = self.block_diff_mask
+        elif self.mode == 'sbd_block_diff':
+            if self.sbd_block_diff_mask is None or self.block_size != self.block_size_orig or q_len != self.sbd_block_diff_mask.shape[-2]:
+                block_mask = self.compute_block_mask(mode='sbd_block_diff', block_size=self.block_size, q_len=q_len)
+            else:
+                block_mask = self.sbd_block_diff_mask
+        else:
+            raise ValueError(f"Unknown attention mode: {self.mode}")
+        attn_output = fused_flex_attention(query_states, key_states, value_states, block_mask=block_mask)
+        attn_output = attn_output.transpose(1, 2).reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, None
+def gumbel_topk(log_w: torch.Tensor, k: int) -> torch.Tensor:
+    """Return a Bool mask of length len(log_w) with exactly k True."""
+    g = -torch.log(-torch.log(torch.rand_like(log_w) + 1e-9) + 1e-9)
+    topk = torch.topk(log_w + g, k).indices
+    mask = torch.zeros_like(log_w, dtype=torch.bool)
+    mask[topk] = True
+    return mask
+class NemotronLabsDiffusionVLMModel(Ministral3PreTrainedModel, GenerationMixin):
+    """
+    A single model with:
+      - a bidirectional encoder + diffusion‐LM head over A
+      - a causal decoder + LM head over B, conditioned on F_A
+    """
+    def __init__(self, config: NemotronLabsDiffusionVLMConfig):
+        super().__init__(config)
+        self.mask_token_id = config.mask_token_id
+        diffusion_config = copy.deepcopy(config)
+        diffusion_config.diffusion_lm = True
+        use_flex = getattr(config, 'enable_self_spec', False)
+        if config.dlm_paradigm in ['block_diff', 'sbd_block_diff']:
+            diffusion_config.attn_class = NemotronLabsDiffusionVLMFlexAttention
+        elif config.dlm_paradigm in ['bidirectional', 'autoregressive']:
+            diffusion_config.attn_class = NemotronLabsDiffusionVLMFlexAttention if use_flex else Ministral3Attention
+            if config.dlm_paradigm == 'autoregressive':
+                diffusion_config.diffusion_lm = False
+        else:
+            raise ValueError(f"Unsupported DLM paradigm: {config.dlm_paradigm}")
+        self.encoder = Ministral3Model(diffusion_config)
+        self.diffusion_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.vocab_size = config.vocab_size
+        self.current_iter_ratio = None
+        self.mdm_loss_function = LOSS_MAPPING['ForMaskedLM']
+        self.causal_loss_function = LOSS_MAPPING['ForCausalLM']
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.encoder.embed_tokens
+    def set_input_embeddings(self, value):
+        self.encoder.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.diffusion_head
+    def set_output_embeddings(self, new_embeddings):
+        self.diffusion_head = new_embeddings
+    def forward_process_complementary(self, input_ids, eps=1e-3, block_size=None, loss_mask=None):
+        device = input_ids.device
+        if self.config.dp_varying_mask_ratio:
+            import torch.distributed as dist
+            dp_rank = 0
+            if dist.is_initialized():
+                try:
+                    dp_rank = dist.get_rank()
+                except Exception:
+                    dp_rank = 0
+            generator = torch.Generator(device=device)
+            generator.manual_seed(torch.seed() + dp_rank)
+        else:
+            generator = None
+        noisy_input_ids = input_ids.clone()
+        input_ids_flat = input_ids.reshape(input_ids.shape[0] * input_ids.shape[1] // block_size, block_size)
+        b, l = input_ids_flat.shape
+        t = torch.rand((b,), device=input_ids.device, generator=generator)
+        p_mask = (1 - eps) * t + eps
+        p_mask = p_mask[:, None].repeat(1, l)
+        masked_indices = (torch.rand((b, l), device=input_ids.device, generator=generator) < p_mask).reshape(noisy_input_ids.shape)
+        input_ids_flat = input_ids_flat.reshape(noisy_input_ids.shape)
+        complementary_noisy_input_ids = input_ids.clone()
+        complementary_masked_indices = ~masked_indices
+        if getattr(self.config, 'always_mask_im_end', False):
+            im_end_mask = (input_ids == self.config.im_end_token_id)
+            masked_indices = masked_indices | im_end_mask
+            complementary_masked_indices = complementary_masked_indices | im_end_mask
+        if loss_mask is not None:
+            masked_indices[loss_mask == 0] = 0
+            complementary_masked_indices[loss_mask == 0] = 0
+        noisy_input_ids[masked_indices] = self.mask_token_id
+        complementary_noisy_input_ids[complementary_masked_indices] = self.mask_token_id
+        noisy_input_ids = torch.cat([noisy_input_ids, complementary_noisy_input_ids], dim=0)
+        masked_indices = torch.cat([masked_indices, complementary_masked_indices], dim=0)
+        return noisy_input_ids, masked_indices, None
+    # ── Vision / multimodal helpers (ported from Mistral3Model) ──────────
+    IMAGE_TOKEN_ID = 19
+    def get_image_features(
+        self,
+        pixel_values: torch.FloatTensor,
+        image_sizes: torch.Tensor,
+    ) -> torch.FloatTensor:
+        """
+        Run the vision tower + multimodal projector and return a flat tensor
+        of image features ready to be scattered into the text embeddings.
+        Mirrors ``Mistral3Model.get_image_features`` from
+        transformers/models/mistral3/modeling_mistral3.py.
+        Returns:
+            Flat (total_image_tokens, hidden_size) tensor.
+        """
+        vision_feature_layer = getattr(self.config, "vision_feature_layer", -1)
+        image_outputs = self.encoder.vision_tower(
+            pixel_values,
+            image_sizes=image_sizes,
+            output_hidden_states=True,
+            return_dict=True,
+        )
+        if isinstance(vision_feature_layer, int):
+            selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
+        else:
+            hs_pool = [image_outputs.hidden_states[idx] for idx in vision_feature_layer]
+            selected_image_feature = torch.cat(hs_pool, dim=-1)
+        image_features = self.encoder.multi_modal_projector(
+            selected_image_feature.squeeze(0), image_sizes,
+        )
+        # Split per image, then re-cat into one flat tensor
+        downsample_ratio = (
+            self.encoder.vision_tower.patch_size
+            * getattr(self.config, "spatial_merge_size", 2)
+        )
+        split_sizes = (
+            (torch.as_tensor(image_sizes, device=image_features.device) // downsample_ratio)
+            .prod(dim=-1)
+            .tolist()
+        )
+        # per_image = torch.split(image_features.squeeze(0), split_sizes)
+        per_image = torch.split(image_features, split_sizes)
+        return torch.cat(per_image, dim=0)          # (total_tokens, hidden)
+    def _is_vision_frozen(self) -> bool:
+        """True if vision_tower and multi_modal_projector have no parameters requiring grad (e.g. --freeze_vision_encoder)."""
+        vt = self.encoder.vision_tower
+        proj = self.encoder.multi_modal_projector
+        vt_has_grad = any(p.requires_grad for p in vt.parameters())
+        proj_has_grad = any(p.requires_grad for p in proj.parameters())
+        return not vt_has_grad and not proj_has_grad
+    def _embed_with_vision(
+        self,
+        input_ids: torch.LongTensor,
+        pixel_values: torch.FloatTensor,
+        image_sizes: torch.Tensor,
+    ) -> torch.FloatTensor:
+        """
+        Embed *input_ids* and scatter vision features into [IMG] pad positions.
+        Returns:
+            inputs_embeds  (batch, seq_len, hidden_size)
+        """
+        inputs_embeds = self.encoder.embed_tokens(input_ids)
+        image_features = self.get_image_features(pixel_values, image_sizes)
+        image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
+        # Boolean mask: positions that are IMG pad tokens
+        special_image_mask = (input_ids == self.IMAGE_TOKEN_ID)
+        if self.training:
+            if self.config.complementary_mask:
+                image_features = image_features.repeat(2, 1)
+            if self.config.dlm_paradigm in ['block_diff', 'sbd_block_diff']:
+                image_features = image_features.repeat(2, 1)
+        assert special_image_mask.sum() == image_features.shape[0], f"special_image_mask.sum() = {special_image_mask.sum()}, image_features.shape[0] = {image_features.shape[0]}"
+        # Expand to hidden dim for masked_scatter
+        special_image_mask = special_image_mask.unsqueeze(-1).expand_as(inputs_embeds)
+        inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
+        return inputs_embeds
+    def forward_process(self, input_ids, eps=1e-3, block_size=None, loss_mask=None):
+        b, l = input_ids.shape
+        device = input_ids.device
+        if self.config.dp_varying_mask_ratio:
+            # Enable different random seeds for each DP rank during sampling
+            import torch.distributed as dist
+            dp_rank = 0
+            if dist.is_initialized():
+                try:
+                    dp_rank = dist.get_rank()
+                except Exception:
+                    dp_rank = 0
+            # Use a local generator to avoid affecting global RNG state
+            generator = torch.Generator(device=device)
+            generator.manual_seed(torch.seed() + dp_rank)
+        else:
+            generator = None
+        if self.config.adaptive_mask_rate:
+            assert block_size is not None
+            # --- simple linear window mapping ---
+            bs_min = getattr(self.config, "t_bs_min", 16)
+            bs_max = getattr(self.config, "t_bs_max", 128)
+            w = getattr(self.config, "t_window_width", 0.6)  # fixed width
+            # fraction in [0,1] (unclamped first)
+            frac = (float(block_size) - float(bs_min)) / max(1.0, float(bs_max - bs_min))
+            # upper bound decreases linearly from 1.0 -> 0.5
+            u_max = 1.0 - w * frac
+            # clamp to [0.6, 1.0] to handle bs outside [bs_min, bs_max]
+            u_max = max(0.6, min(1.0, u_max))
+            u_min = u_max - w  # ensures width = w
+            # sample t ~ Uniform(u_min, u_max)
+            t = u_min + (u_max - u_min) * torch.rand(b, device=device, generator=generator)
+        else:
+            t = torch.rand(b, device=device, generator=generator)
+        p_mask = (1 - eps) * t + eps  # shape: (b,)
+        p_mask = p_mask[:, None].expand(-1, l)  # shape: (b, l)
+        masked_indices = torch.rand((b, l), device=device) < p_mask
+        if loss_mask is not None:
+            masked_indices[loss_mask == 0] = 0
+        noisy_batch = torch.where(masked_indices, self.mask_token_id, input_ids)
+        return noisy_batch, masked_indices, p_mask
+    def forward_process_exp(
+        self,
+        input_ids: torch.Tensor,
+        eps: float = 1e-3,
+        block_size: int | None = None,
+        half_life_ratio: float = 0.25, # λ = ln 2 / (half_life_ratio·L)
+        loss_mask: Optional[torch.Tensor] = None,
+    ):
+        """
+        Two-stage corruption with optional per-block sampling.
+        • Stage 1:  m ~ U(eps, 1)   →   k = round(m · len)  (exact budget).
+        • Stage 2:  sample exactly k positions with weights
+                    w_i(m) = exp[ λ · (1−m) · i ]   (late-heavy when m→0,
+                                                     uniform when m→1).
+          If `block_size` is given, the procedure is run *independently*
+          inside each contiguous block of that length (last block may be shorter).
+          When block_size is provided, m is sampled per-block and p_mask is per-block.
+        Args
+        ----
+        input_ids : (B, L)  LongTensor
+        eps       : minimum corruption ratio
+        block_size: if not None, operate block-wise with per-block m sampling
+        half_life_ratio : controls steepness when m→0
+        """
+        B, L = input_ids.shape
+        device = input_ids.device
+        dtype  = torch.float32
+        masked_indices = torch.zeros((B, L), dtype=torch.bool, device=device)
+        p_mask = torch.zeros((B, L), dtype=dtype, device=device)
+        # ---------- Stage 1 & 2: whole-sentence or block-wise -------------------
+        for b in range(B):
+            if block_size is None:
+                # ---------- Per-batch sampling (original behavior) ----------
+                m = eps + (1.0 - eps) * torch.rand(1, device=device).item()   # scalar
+                k_tot = int(round(m * L))
+                k_tot = max(1, min(k_tot, L))  # clamp to [1, L]
+                # Fill p_mask for this batch
+                p_mask[b, :] = m
+                slope = 1.0 - m          # ∈ [0,1]; 0 ⇒ uniform, 1 ⇒ late-heavy
+                # ------- single pool over the whole sentence -------------
+                lam_base = math.log(2.0) / (half_life_ratio * L) # base decay rate (λ when slope=1)
+                pos   = torch.arange(L, device=device, dtype=dtype)
+                log_w = (lam_base * slope * pos).clone()
+                masked_indices[b] = gumbel_topk(log_w, k_tot)
+            else:
+                # ---------- Per-block sampling ----------
+                num_blocks = math.ceil(L / block_size)
+                lam_base = math.log(2.0) / (half_life_ratio * block_size) # base decay rate (λ when slope=1)
+                for blk in range(num_blocks):
+                    start = blk * block_size
+                    end   = min((blk + 1) * block_size, L)
+                    blk_len = end - start
+                    # Sample m per block
+                    m_blk = eps + (1.0 - eps) * torch.rand(1, device=device).item()
+                    # Fill p_mask for this block
+                    p_mask[b, start:end] = m_blk
+                    # per-block budget
+                    k_blk = int(round(m_blk * blk_len))
+                    k_blk = max(0, min(k_blk, blk_len))
+                    if k_blk == 0:
+                        continue
+                    slope = 1.0 - m_blk          # ∈ [0,1]; 0 ⇒ uniform, 1 ⇒ late-heavy
+                    pos   = torch.arange(blk_len, device=device, dtype=dtype)
+                    log_w = lam_base * slope * pos
+                    blk_mask = gumbel_topk(log_w, k_blk)
+                    masked_indices[b, start:end] = blk_mask
+        if loss_mask is not None:
+            masked_indices[loss_mask == 0] = 0
+        noisy_batch = torch.where(masked_indices, self.mask_token_id, input_ids)
+        return noisy_batch, masked_indices, p_mask
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.Tensor]   = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        labels: Optional[torch.LongTensor]       = None,
+        split_len: Optional[int]                 = None,
+        past_key_values: Optional[Cache]         = None,
+        block_size: Optional[int]                = None,
+        block_diff_ppl: bool                     = False,
+        eps: float                               = 1e-3,
+        is_teacher: bool                        = False,
+        masked_indices: Optional[torch.Tensor]   = None,
+        p_mask: Optional[torch.Tensor]           = None,
+        teacher_logits: Optional[torch.Tensor]   = None,
+        masked_indices_teacher: Optional[torch.Tensor] = None,
+        loss_mask: Optional[torch.Tensor] = None,
+        ce_loss_weight: float = 1.0,
+        output_last_hidden_states_only: bool = False,
+        skip_loss: bool = False,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        image_sizes: Optional[torch.Tensor]      = None,
+        **kwargs,
+    ) -> CausalLMOutputWithPast:
+        batch_size, seq_len = input_ids.shape
+        if self.config.dlm_paradigm == 'bidirectional' or self.config.dlm_paradigm == 'autoregressive':
+            if labels is not None and torch.rand(1) < self.config.random_length_prob:
+                random_length = torch.randint(2, input_ids.shape[1] + 1, (1,))
+                input_ids = input_ids[:, :random_length]
+                labels = labels[:, :random_length]
+                if attention_mask is not None:
+                    attention_mask = attention_mask[:, :random_length]
+                if position_ids is not None:
+                    position_ids = position_ids[:, :random_length]
+                if loss_mask is not None:
+                    loss_mask = loss_mask[:, :random_length]
+        elif self.config.dlm_paradigm in ['block_diff', 'sbd_block_diff']:
+            if labels is not None and block_size is None:
+                if torch.rand(1) < self.config.random_length_prob:
+                    block_size = torch.randint(1, 8, (1,)).item() * 4  ## [4, 32] divisible by 4
+                else:
+                    block_size = self.config.block_size
+        else:
+            raise ValueError(f"Unknown dLM paradigm: {self.config.dlm_paradigm}")
+        if labels is not None and self.config.dlm_paradigm != 'autoregressive':
+            if masked_indices is not None:
+                # assert p_mask is not None
+                if loss_mask is not None:
+                    masked_indices[loss_mask == 0] = 0
+                noisy_inputs = torch.where(masked_indices, self.mask_token_id, input_ids)
+            else:
+                if self.config.complementary_mask:
+                    loss_mask = (labels != -100)
+                    noisy_inputs, masked_indices, p_mask = self.forward_process_complementary(input_ids, eps=eps, block_size=block_size, loss_mask=loss_mask)
+                else:
+                    if self.config.tok_mask_half_life_ratio is not None:
+                        noisy_inputs, masked_indices, p_mask = self.forward_process_exp(input_ids, eps=eps, block_size=block_size, half_life_ratio=self.config.tok_mask_half_life_ratio, loss_mask=loss_mask)
+                    else:
+                        noisy_inputs, masked_indices, p_mask = self.forward_process(input_ids, eps=eps, block_size=block_size, loss_mask=loss_mask)
+        else:
+            noisy_inputs = input_ids
+            masked_indices = None
+            p_mask = None
+        if self.config.dlm_paradigm in ['block_diff', 'sbd_block_diff']:
+            for layer in self.encoder.layers:
+                if hasattr(layer.self_attn, 'set_attention_mode'):
+                    layer.self_attn.set_attention_mode(self.config.dlm_paradigm, block_size=block_size)
+        input_ids_len = noisy_inputs.shape[1]
+        if labels is not None and self.config.dlm_paradigm in ['block_diff', 'sbd_block_diff']:
+            if position_ids is None:
+                position_ids = torch.arange(input_ids_len, device=noisy_inputs.device).unsqueeze(0)
+            if self.config.complementary_mask:
+                noisy_inputs = torch.cat([noisy_inputs, torch.cat([input_ids, input_ids], dim=0)], dim=1)
+            else:
+                noisy_inputs = torch.cat([noisy_inputs, input_ids], dim=1)
+        if block_diff_ppl:
+            if position_ids is None:
+                position_ids = torch.arange(input_ids_len // 2, device=noisy_inputs.device).unsqueeze(0)
+        # ── Vision: replace IMG pad embeddings with image features ────────
+        if pixel_values is not None and image_sizes is not None:
+            inputs_embeds = self._embed_with_vision(noisy_inputs, pixel_values, image_sizes)
+            enc_out = self.encoder(
+                past_key_values=past_key_values,
+                inputs_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                is_training=(labels is not None) or (block_diff_ppl),
+                **kwargs,
+            )
+        elif self.training and pixel_values is None and not self._is_vision_frozen():
+            vt = self.encoder.vision_tower
+            _p = vt.patch_size
+            _merge = getattr(self.config, "spatial_merge_size", 2)
+            _side = _p * _merge
+            _c = getattr(vt.config, "num_channels", 3)
+            _dtype = next(vt.parameters()).dtype
+            dummy_pixel = torch.zeros(
+                1, _c, _side, _side,
+                dtype=_dtype, device=noisy_inputs.device,
+            )
+            dummy_image_sizes = torch.tensor(
+                [(int(_side), int(_side))],
+                dtype=torch.long, device=noisy_inputs.device,
+            )
+            dummy_features = self.get_image_features(dummy_pixel, dummy_image_sizes)
+            inputs_embeds = self.encoder.embed_tokens(noisy_inputs)
+            inputs_embeds = inputs_embeds + dummy_features.sum() * 0
+            enc_out = self.encoder(
+                past_key_values=past_key_values,
+                inputs_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                is_training=(labels is not None) or (block_diff_ppl),
+                **kwargs,
+            )
+        else:
+            enc_out = self.encoder(
+                past_key_values=past_key_values,
+                input_ids=noisy_inputs,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                is_training=(labels is not None) or (block_diff_ppl),
+                **kwargs,
+            )
+        if output_last_hidden_states_only:
+            return BaseModelOutput(last_hidden_state=enc_out.last_hidden_state)
+        logits = self.diffusion_head(enc_out.last_hidden_state)  # (batch, len_B, vocab)
+        causal_logits = None
+        if labels is not None and self.config.dlm_paradigm in ['block_diff', 'sbd_block_diff']:
+            if self.config.dlm_paradigm == 'sbd_block_diff':
+                causal_logits = logits[:, input_ids_len:]
+            else:
+                causal_logits = None
+            logits = logits[:, :input_ids_len]
+        loss = None
+        if getattr(self.config, 'complementary_mask', False) and self.config.dlm_paradigm == 'sbd_block_diff':
+            _raw_nib = kwargs.get('num_items_in_batch', None)
+            kwargs = {**kwargs, 'num_items_in_batch': 2 * kwargs.get('num_items_in_batch', 1)}
+            if self.training and (not hasattr(self, '_nib_logged') or not self._nib_logged):
+                import torch.distributed as dist
+                _rank = dist.get_rank() if dist.is_initialized() else 0
+                if _rank == 0:
+                    print(f"[DEBUG-NIB] raw num_items_in_batch from Trainer: {_raw_nib}, "
+                          f"after 2x: {kwargs['num_items_in_batch']}, "
+                          f"labels non-(-100): {(labels != -100).sum().item() if labels is not None else 'N/A'}, "
+                          f"batch_size={input_ids.shape[0]}, seq_len={input_ids.shape[1]}", flush=True)
+                    self._nib_logged = True
+        if labels is not None and not skip_loss:
+            if self.config.dlm_paradigm == 'autoregressive':
+                shift_logits = logits[..., :-1, :].contiguous()
+                shift_labels = labels[..., 1:].contiguous()
+                if loss_mask is None:
+                    loss_fct = CrossEntropyLoss()
+                    shift_logits = shift_logits.view(-1, shift_logits.size(-1))
+                    shift_labels = shift_labels.view(-1)
+                    loss = loss_fct(shift_logits, shift_labels)
+                else:
+                    loss_mask = loss_mask[..., 1:].contiguous()
+                    loss_fct = CrossEntropyLoss(reduction='none')
+                    shift_logits = shift_logits.view(-1, shift_logits.size(-1))
+                    shift_labels = shift_labels.view(-1)
+                    shift_labels = shift_labels.to(shift_logits.device)
+                    token_losses = loss_fct(shift_logits, shift_labels)
+                    flat_loss_mask = loss_mask.reshape(-1)
+                    loss = token_losses[flat_loss_mask == 1].sum() / flat_loss_mask.sum()
+            else:
+                # Handle DREAM vs LLADA style losses
+                if hasattr(self.config, 'dlm_type') and self.config.dlm_type == 'dream':
+                    logits = logits[..., :-1, :].contiguous()
+                    labels = labels[..., 1:].contiguous()
+                    masked_indices = masked_indices[:, 1:]
+                    if p_mask is not None:
+                        p_mask = p_mask[:, 1:]
+                if self.config.ada_perm_ratio_per_block is not None:
+                    # Only compute loss for the top ada_perm_ratio_per_block tokens by confidence within each block
+                    block_size = self.config.block_size
+                    batch_size, seq_len = masked_indices.shape
+                    num_blocks = seq_len // block_size
+                    # Get the max logit (confidence) for each position
+                    confidence = logits.max(dim=-1).values.detach()  # (batch_size, seq_len)
+                    # Create a mask for tokens to include in loss
+                    selected_mask = torch.zeros_like(masked_indices, dtype=torch.bool)
+                    for blk in range(num_blocks):
+                        start = blk * block_size
+                        end = min((blk + 1) * block_size, seq_len)
+                        # Get masked indices within this block
+                        block_masked = masked_indices[:, start:end]  # (batch_size, block_len)
+                        block_confidence = confidence[:, start:end]  # (batch_size, block_len)
+                        for b in range(batch_size):
+                            # Get positions that are masked in this block for this batch
+                            masked_positions = torch.where(block_masked[b])[0]
+                            num_masked = len(masked_positions)
+                            if num_masked > 0:
+                                # Number of tokens to keep (top by confidence)
+                                k = min(max(1, int(block_size * self.config.ada_perm_ratio_per_block)), num_masked)
+                                # Get confidence values for masked positions
+                                masked_confidence = block_confidence[b, masked_positions]
+                                # Get indices of top-k confident tokens
+                                _, topk_indices = torch.topk(masked_confidence, k)
+                                selected_positions = masked_positions[topk_indices]
+                                # Mark these positions in the selected mask
+                                selected_mask[b, start + selected_positions] = True
+                    # Calculate loss only for selected positions
+                    token_loss = torch.nn.functional.cross_entropy(
+                        logits[selected_mask],
+                        labels[selected_mask],
+                        reduction='none'
+                    ) / p_mask[selected_mask]
+                    num_mask_tokens = selected_mask.sum()
+                elif getattr(self.config, 'complementary_mask', False):
+                    token_loss = self.mdm_loss_function(
+                        logits=logits[masked_indices],
+                        labels=torch.cat([labels, labels], dim=0)[masked_indices],
+                        vocab_size=self.config.vocab_size,
+                        **kwargs
+                    )
+                    num_mask_tokens = masked_indices.sum()
+                else:
+                    # Calculate token-wise cross entropy loss for masked positions in B
+                    token_loss = torch.nn.functional.cross_entropy(
+                        logits[masked_indices],
+                        labels[masked_indices],
+                        reduction='none'
+                    ) / p_mask[masked_indices]
+                    num_mask_tokens = masked_indices.sum()
+                if self.config.global_loss_avg:
+                    loss = token_loss.sum()
+                else:
+                    loss = token_loss.sum() / num_mask_tokens
+                if self.config.ada_dlm_loss_ratio is not None:
+                    assert self.current_iter_ratio is not None
+                    assert self.config.dlm_loss_weight is not None
+                    dlm_loss_weight = min(self.config.dlm_loss_weight, self.current_iter_ratio / self.config.ada_dlm_loss_ratio * self.config.dlm_loss_weight)
+                    loss = dlm_loss_weight * loss
+                elif self.config.dlm_loss_weight is not None:
+                    loss = self.config.dlm_loss_weight * loss
+                if self.config.dlm_paradigm == 'sbd_block_diff':
+                    if getattr(self.config, 'complementary_mask', False):
+                        ar_loss = self.causal_loss_function(
+                            logits=causal_logits[:logits.shape[0] // 2, :],
+                            labels=labels,
+                            vocab_size=self.config.vocab_size,
+                            **kwargs
+                        )
+                        _diff_val = loss.detach().item()
+                        _ar_val = ar_loss.detach().item()
+                        if not hasattr(self, '_loss_accum_count'):
+                            self._loss_diff_accum = 0.0
+                            self._loss_ar_accum = 0.0
+                            self._loss_accum_count = 0
+                        self._loss_diff_accum += _diff_val
+                        self._loss_ar_accum += _ar_val
+                        self._loss_accum_count += 1
+                        self.loss_diffusion = self._loss_diff_accum
+                        self.loss_ar = self._loss_ar_accum
+                        loss = loss + ar_loss
+                    else:
+                        causal_logits = causal_logits[..., :-1, :].contiguous()
+                        causal_logits = causal_logits.view(-1, causal_logits.size(-1))
+                        if hasattr(self.config, 'dlm_type') and self.config.dlm_type == 'dream':
+                            causal_labels = labels.view(-1)
+                        else:
+                            causal_labels = labels[..., 1:].contiguous().view(-1)
+                        if self.config.global_loss_avg:
+                            loss_fct = CrossEntropyLoss(reduction='sum')
+                            ar_loss = loss_fct(causal_logits, causal_labels)
+                            self.loss_diffusion = loss.detach().item() / num_mask_tokens
+                            self.loss_ar = ar_loss.detach().item() / seq_len
+                            loss = loss + self.config.ar_loss_weight * ar_loss
+                        else:
+                            loss_fct = CrossEntropyLoss()
+                            ar_loss = loss_fct(causal_logits, causal_labels)
+                            self.loss_diffusion = loss.detach().item()
+                            self.loss_ar = ar_loss.detach().item()
+                            loss = loss + self.config.ar_loss_weight * ar_loss
+                # if self.config.global_loss_avg:
+                #     if self.config.dlm_paradigm == 'sbd_block_diff':
+                #         loss = (loss, num_mask_tokens + int(self.config.ar_loss_weight * seq_len))
+                #     else:
+                #         loss = (loss, num_mask_tokens)
+        return NemotronLabsDiffusionVLMOutputWithPast(
+            loss=loss if not is_teacher else logits,
+            logits=logits,
+            causal_logits=causal_logits,
+            past_key_values=enc_out.past_key_values,
+            hidden_states=None,
+            attentions=None,
+        )
+    def generate(self, prompt_ids, max_new_tokens, steps, block_length, shift_logits, threshold,
+                 causal_context=True, temperature=0, pixel_values=None, image_sizes=None, eos_token_id=None):
+        out_ids, nfe = generate_with_prefix_cache_block_diff(
+                        model=self,
+                        prompt=prompt_ids,
+                        gen_length=max_new_tokens,
+                        steps=steps,
+                        block_length=block_length,
+                        remasking="low_confidence",
+                        temperature=temperature,
+                        mask_id=self.mask_token_id,
+                        threshold=threshold,
+                        shift_logits=shift_logits,
+                        neg_entropy=False,
+                        causal_context=causal_context,
+                        pixel_values=pixel_values,
+                        image_sizes=image_sizes,
+                        eos_token_id=eos_token_id,
+                    )
+        return out_ids, nfe
+    @torch.no_grad()
+    def sbd_inference_diffusion_quadratic(
+        self,
+        clean_input_ids: Optional[torch.Tensor],
+        draft_input_ids: torch.Tensor,
+        block_length: int,
+        draft_only: bool = False,
+        past_key_values: Optional[Cache] = None,
+        use_cache: bool = False,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        image_sizes: Optional[torch.Tensor] = None,
+    ):
+        enc_config = self.encoder.config
+        enc_config.use_sbd_objective = True
+        enc_config.block_length = block_length
+        if draft_only:
+            assert clean_input_ids is not None
+            if use_cache and past_key_values is None:
+                past_key_values = DynamicCache()
+            enc_config.self_spec_inference_mode = "default"
+            input_ids = torch.cat([clean_input_ids, draft_input_ids], dim=-1)
+            if pixel_values is not None and image_sizes is not None:
+                inputs_embeds = self._embed_with_vision(input_ids, pixel_values, image_sizes)
+                outputs = self.encoder(
+                    inputs_embeds=inputs_embeds,
+                    position_ids=None,
+                    past_key_values=past_key_values,
+                    use_cache=use_cache,
+                    is_training=False,
+                )
+            else:
+                outputs = self.encoder(
+                    input_ids=input_ids,
+                    position_ids=None,
+                    past_key_values=past_key_values,
+                    use_cache=use_cache,
+                    is_training=False,
+                )
+            hidden_states = outputs.last_hidden_state
+            logits = self.diffusion_head(hidden_states)
+            past_key_values = getattr(outputs, "past_key_values", None)
+            if use_cache and past_key_values is not None:
+                _crop_dynamic_cache(past_key_values, clean_input_ids.shape[1])
+            return logits, past_key_values
+        else:
+            enc_config.self_spec_inference_mode = "quadratic"
+            draft_len = block_length * (block_length + 1)
+            draft_input_ids = torch.cat(
+                [
+                    draft_input_ids.view(-1, block_length, 1),
+                    torch.full(
+                        (draft_input_ids.shape[0], block_length, block_length),
+                        fill_value=self.config.mask_token_id,
+                        device=draft_input_ids.device,
+                    ),
+                ],
+                dim=-1,
+            ).view(-1, draft_len)
+            if use_cache:
+                assert past_key_values is not None, (
+                    "Past key values should be provided when using cache, e.g. run draft_only=True first."
+                )
+                assert clean_input_ids is None, (
+                    "Clean input ids should already be in cache, thus none should be provided."
+                )
+                clean_len = past_key_values.get_seq_length()
+                input_ids = draft_input_ids
+            else:
+                clean_len = clean_input_ids.shape[1]
+                input_ids = torch.cat([clean_input_ids, draft_input_ids], dim=-1)
+            per_block_position_ids = torch.arange(
+                clean_len, clean_len + block_length + 1, device=draft_input_ids.device
+            )[None,].repeat(block_length, 1)
+            per_block_position_ids += torch.arange(block_length, device=draft_input_ids.device).view(-1, 1)
+            if use_cache:
+                position_ids = per_block_position_ids.view(-1)[None,]
+            else:
+                clean_position_ids = torch.arange(clean_len, device=draft_input_ids.device)
+                position_ids = torch.cat([clean_position_ids, per_block_position_ids.view(-1)], dim=-1)[None,]
+            if pixel_values is not None and image_sizes is not None and not use_cache:
+                inputs_embeds = self._embed_with_vision(input_ids, pixel_values, image_sizes)
+                outputs = self.encoder(
+                    inputs_embeds=inputs_embeds,
+                    position_ids=position_ids,
+                    past_key_values=past_key_values,
+                    use_cache=use_cache,
+                    is_training=False,
+                )
+            else:
+                outputs = self.encoder(
+                    input_ids=input_ids,
+                    position_ids=position_ids,
+                    past_key_values=past_key_values,
+                    use_cache=use_cache,
+                    is_training=False,
+                )
+            hidden_states = outputs.last_hidden_state
+            logits = self.diffusion_head(hidden_states)
+            past_key_values = getattr(outputs, "past_key_values", None)
+            if use_cache and past_key_values is not None:
+                _extract_draft_kv_cache(past_key_values, clean_len, block_length)
+            return logits, past_key_values
+    @torch.no_grad()
+    def self_spec_generate(
+        self,
+        prompt_ids: torch.Tensor,
+        max_new_tokens: int = 128,
+        steps: int = 128,
+        block_length: int = 16,
+        ar_mix_weight: Optional[float] = None,
+        temperature: float = 0.0,
+        mask_token_id: Optional[int] = None,
+        eos_token_id: Optional[int] = None,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        image_sizes: Optional[torch.Tensor] = None,
+    ):
+        self.config.use_sbd_objective = True
+        self.config.dlm_paradigm = "sbd"
+        if prompt_ids.shape[0] != 1:
+            raise ValueError("Self speculation quadratic decoding currently requires batch_size == 1")
+        token_mask_id = mask_token_id if mask_token_id is not None else self.config.mask_token_id
+        if eos_token_id is None:
+            eos_token_id = getattr(self.config, "eos_token_id", None)
+        x = torch.full(
+            (1, prompt_ids.shape[1] + max_new_tokens + block_length * 2),
+            token_mask_id,
+            dtype=torch.long,
+            device=prompt_ids.device,
+        )
+        x[:, : prompt_ids.shape[1]] = prompt_ids.clone()
+        if max_new_tokens % block_length != 0:
+            raise ValueError("max_new_tokens must be divisible by block_length")
+        num_blocks = max_new_tokens // block_length
+        if steps % num_blocks != 0:
+            raise ValueError("steps must be divisible by (max_new_tokens // block_length)")
+        prompt_len = prompt_ids.shape[1]
+        nfe = 0
+        nfe += 1
+        logits, past_key_values = self.sbd_inference_diffusion_quadratic(
+            clean_input_ids=x[:, :prompt_len],
+            draft_input_ids=x[:, prompt_len : prompt_len + block_length],
+            block_length=block_length,
+            draft_only=True,
+            use_cache=True,
+            pixel_values=pixel_values,
+            image_sizes=image_sizes,
+        )
+        logits_proposal = logits[:, prompt_len - 1 : prompt_len + block_length]
+        logits_proposal[:, 1] = logits_proposal[:, 0]
+        logits_proposal = logits_proposal[:, 1:]
+        x0_proposal = torch.argmax(logits_proposal, dim=-1)
+        x[:, prompt_len : prompt_len + block_length] = x0_proposal
+        total_accept_token = 0
+        while True:
+            nfe += 1
+            block_start = prompt_len + total_accept_token
+            block_end = block_start + block_length
+            draft_input_ids = x[:, block_start:block_end]
+            logits, past_key_values = self.sbd_inference_diffusion_quadratic(
+                clean_input_ids=None,
+                draft_input_ids=draft_input_ids,
+                block_length=block_length,
+                draft_only=False,
+                past_key_values=past_key_values,
+                use_cache=True,
+                pixel_values=pixel_values,
+                image_sizes=image_sizes,
+            )
+            useful_token_logits = logits.view(1, block_length, block_length + 1, -1)
+            if ar_mix_weight is None:
+                useful_token_logits[:, :, 1] = useful_token_logits[:, :, 0]
+            else:
+                if not (0.0 <= ar_mix_weight <= 1.0):
+                    raise ValueError("ar_mix_weight must be between 0 and 1")
+                mix_logits = useful_token_logits[:, :, 0] * ar_mix_weight + useful_token_logits[:, :, 1] * (1 - ar_mix_weight)
+                useful_token_logits[:, :, 0] = mix_logits
+                useful_token_logits[:, :, 1] = mix_logits
+            if temperature > 0:
+                useful_token_logits = useful_token_logits / temperature
+            useful_token_pred = torch.argmax(useful_token_logits, dim=-1)
+            new_draft_input_ids = useful_token_pred[:, 0, 1:]
+            accept_cnt = 1
+            while accept_cnt < block_length:
+                if useful_token_pred[:, accept_cnt - 1, 0].item() != draft_input_ids[:, accept_cnt].item():
+                    break
+                new_draft_input_ids = useful_token_pred[:, accept_cnt, 1:]
+                accept_cnt += 1
+            x[:, block_start : block_start + accept_cnt] = draft_input_ids[:, :accept_cnt]
+            # EoS early stopping
+            if eos_token_id is not None:
+                accepted = x[0, block_start : block_start + accept_cnt]
+                eos_positions = (accepted == eos_token_id).nonzero(as_tuple=True)[0]
+                if len(eos_positions) > 0:
+                    first_eos_rel = eos_positions[0].item()
+                    total_accept_token += first_eos_rel + 1
+                    output_end = prompt_len + total_accept_token
+                    return x[:, :output_end], nfe
+            x[:, block_start + accept_cnt : block_start + accept_cnt + block_length] = new_draft_input_ids
+            _crop_dynamic_cache(past_key_values, block_start + accept_cnt)
+            total_accept_token += accept_cnt
+            if total_accept_token >= max_new_tokens:
+                break
+        return x[:, : -(block_length * 2)], nfe
+__all__ = ["NemotronLabsDiffusionVLMModel", "NemotronLabsDiffusionVLMFlexAttention"]

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "additional_special_tokens": [
+    "|<MASK>|"
+  ],
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenization_nemotron_labs_diffusion_vlm.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""
+Custom tokenizer for Nemotron-Diffusion-Exp-Ministral-8B-Instruct (final-template).
+Extends PreTrainedTokenizerFast with a `process_messages` method that
+handles image token expansion and pixel value preprocessing, analogous
+to MistralCommonBackend.apply_chat_template(return_dict=True).
+Usage:
+    tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
+    result = tokenizer.process_messages(messages)
+    # result["input_ids"]     – (1, seq_len) with expanded image tokens
+    # result["pixel_values"]  – (N, 3, H, W)  if images present
+    # result["image_sizes"]   – list of (H, W) tuples
+"""
+from typing import Any, Dict, List
+from transformers import PreTrainedTokenizerFast
+from .image_processing import process_messages as _process_messages
+class NemotronLabsDiffusionVLMTokenizerFast(PreTrainedTokenizerFast):
+    """PreTrainedTokenizerFast + image-aware process_messages()."""
+    def process_messages(
+        self,
+        messages: List[Dict[str, Any]],
+        **kwargs,
+    ) -> Dict[str, Any]:
+        """
+        Process chat messages with optional images.
+        Renders the chat template, expands image placeholders based on
+        actual image dimensions, preprocesses pixel values, and tokenizes.
+        Args:
+            messages: OpenAI-style list of message dicts.
+            **kwargs: forwarded to image_processing.process_messages
+                      (patch_size, spatial_merge_size, max_image_size,
+                       return_tensors, enable_thinking).
+        Returns:
+            dict with input_ids, and optionally pixel_values + image_sizes.
+        """
+        return _process_messages(self, messages, **kwargs)

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2e04613060199ab156b35bf2334e381748ae41311e8785efb330bc66e16670d8
+size 17077689

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff