MingxuChai commited on
Commit
ed897f6
·
verified ·
1 Parent(s): c4411ba

Add files using upload-large-folder tool

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,84 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ tags:
4
+ - diffusion
5
+ - vision-language
6
+ - qwen2.5-vl
7
+ pipeline_tag: image-text-to-text
8
+ library_name: transformers
9
  ---
10
+
11
+
12
+ <div align="center">
13
+
14
+ <h1>DiffusionVL: Translating Any Autoregressive Models into <br> Diffusion Vision Language Models</h1>
15
+
16
+ **_SOTA dVLM Performance with <5% Data & 2.0× Inference Speedup!_**
17
+
18
+ [Lunbin Zeng](https://github.com/xiazhi1)<sup>1,\*</sup>, [Jingfeng Yao](https://github.com/JingfengYao)<sup>1,\*</sup>, [Bencheng Liao](https://github.com/LegendBC)<sup>1</sup>, [Hongyuan Tao](https://github.com/Hongyuan-Tao)<sup>1</sup>, [Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)<sup>1</sup>, [Xinggang Wang](https://xwcv.github.io)<sup>1, ✉️</sup>
19
+
20
+ <sup>1</sup>Huazhong University of Science and Technology
21
+
22
+ <sup>*</sup>equal contribution, <sup>✉️</sup>corresponding author, xgwang@hust.edu.cn
23
+
24
+ [![arXiv](https://img.shields.io/badge/arXiv-DiffusionVL-b31b1b.svg)](https://arxiv.org/abs/2512.15713) [![Hugging Face Paper](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Paper-red)](https://huggingface.co/papers/2512.15713) <a href="https://github.com/hustvl/DiffusionVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a> <a href="https://huggingface.co/collections/hustvl/diffusionvl"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face"></a>
25
+
26
+ </div>
27
+
28
+ ## 📰 News
29
+
30
+ - **[2025.12.25]** 🎄 We have completed our release plan ahead of schedule. **DiffusionVL is now fully open-sourced.** Merry Christmas to the community!
31
+ - **[2025.12.18]** 🎉 Our paper **DiffusionVL** is released on arXiv! We also release the DiffusionVL models translated from Qwen2.5VL on Hugging Face.
32
+
33
+ ## 🚀 Release Plan
34
+ - [x] Release paper
35
+ - [x] Release DiffusionVL model weights (translated from AR-VLMs)
36
+ - [x] Release DiffusionVL model weights (translated from AR-LMs)
37
+ - [x] Release evaluation code
38
+ - [x] Release training code
39
+
40
+ ## 📄 Introduction
41
+
42
+ The diffusion paradigm has emerged as a promising alternative to autoregressive (AR) models, offering the potential for efficient parallel decoding. However, existing diffusion vision language models (dVLMs) largely lag behind mainstream autoregressive vision language models in performance, primarily due to the capability limitations of their base diffusion language models.
43
+
44
+ DiffusionVL bridges this gap by answering a fundamental question: ***Can we directly translate any existing autoregressive models into powerful diffusion vision language models?*** We propose a diffusion finetuning framework that "translates" any pretrained AR model into a diffusion vision language model through a simple paradigm shift and modality shift. Unlike prior dVLMs restricted by fixed generation lengths, DiffusionVL introduces a novel block decoding strategy. This allows for arbitrary-length generation and KV-cache reuse. With this integrated design, despite training with less than 5% of the training data required by previous methods, DiffusionVL translated from AR-VLMs achieves a state-of-the-art performance among exsiting dVLMs and delivers a 2.0× inference speedup.
45
+
46
+ ## ✨ Highlights
47
+
48
+ - **Universal Translation Framework:** Translate any AR models into dVLMs with a simple yet effective approach.
49
+
50
+ - **Superior Performance:** Achieve SOTA dVLM performance using <5% training data (738K vs 16.5M samples).
51
+
52
+ - **2.0× Faster Inference:** Block decoding strategy enables KV-cache reuse and 2.0× speedup over previous dVLMs.
53
+
54
+ <div align="center">
55
+ <img src="https://github.com/hustvl/DiffusionVL/raw/main/assets/benchmark.png" alt="Benchmark Image" width="800">
56
+ <img src="https://github.com/hustvl/DiffusionVL/raw/main/assets/framework.png" alt="Framework" width="800">
57
+ </div>
58
+
59
+ ## 🚀 Get Started
60
+
61
+ | Document | Description |
62
+ | :--- | :--- |
63
+ | [Installation](https://github.com/hustvl/DiffusionVL/raw/main/docs/INSTALLATION.md) | Environment setup, data and model preparation |
64
+ | [Training & Evaluation](https://github.com/hustvl/DiffusionVL/raw/main/docs/TRAINING_EVALUATION.md) | Train and evaluate DiffusionVL models |
65
+ | [Inference](https://github.com/hustvl/DiffusionVL/raw/main/docs/INFERENCE.md) | Quick inference with pre-trained models |
66
+
67
+
68
+ ## ❤️ Acknowledgements
69
+
70
+ This repo is mainly built on [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL), [LLaDA-V](https://github.com/ML-GSAI/LLaDA-V), [BD3LMs](https://github.com/kuleshov-group/bd3lms) and [SDAR](https://github.com/JetAstra/SDAR), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval). We thank the authors for their open-source contributions.
71
+
72
+ ## 📝 Citation
73
+ If you find our work useful, please cite our paper:
74
+ ```
75
+ @misc{zeng2025diffusionvltranslatingautoregressivemodels,
76
+ title={DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models},
77
+ author={Lunbin Zeng and Jingfeng Yao and Bencheng Liao and Hongyuan Tao and Wenyu Liu and Xinggang Wang},
78
+ year={2025},
79
+ eprint={2512.15713},
80
+ archivePrefix={arXiv},
81
+ primaryClass={cs.CV},
82
+ url={https://arxiv.org/abs/2512.15713},
83
+ }
84
+ ```
added_tokens.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|box_end|>": 151649,
5
+ "<|box_start|>": 151648,
6
+ "<|endoftext|>": 151643,
7
+ "<|file_sep|>": 151664,
8
+ "<|fim_middle|>": 151660,
9
+ "<|fim_pad|>": 151662,
10
+ "<|fim_prefix|>": 151659,
11
+ "<|fim_suffix|>": 151661,
12
+ "<|im_end|>": 151645,
13
+ "<|im_start|>": 151644,
14
+ "<|image_pad|>": 151655,
15
+ "<|object_ref_end|>": 151647,
16
+ "<|object_ref_start|>": 151646,
17
+ "<|quad_end|>": 151651,
18
+ "<|quad_start|>": 151650,
19
+ "<|repo_name|>": 151663,
20
+ "<|video_pad|>": 151656,
21
+ "<|vision_end|>": 151653,
22
+ "<|vision_pad|>": 151654,
23
+ "<|vision_start|>": 151652,
24
+ "<nl>": 151672,
25
+ "<fcel>": 151673,
26
+ "<ecel>": 151674,
27
+ "<lcel>": 151675,
28
+ "<ucel>": 151676,
29
+ "<xcel>": 151677,
30
+ "<ched>": 151678
31
+ }
chat_template.jinja ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system
2
+ You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
3
+ {% endif %}<|im_start|>{{ message['role'] }}
4
+ {% if message['content'] is string %}{{ message['content'] }}<|im_end|>
5
+ {% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}<image>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}<video>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>
6
+ {% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant
7
+ {% endif %}
config.json ADDED
@@ -0,0 +1,242 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_faster_video": false,
3
+ "add_time_instruction": false,
4
+ "anneal_start_block_size": 1,
5
+ "architectures": [
6
+ "DiffusionVL_Qwen2_5_VL_ForConditionalGeneration"
7
+ ],
8
+ "attention_dropout": 0.0,
9
+ "bd3lm_antithetic_sampling": true,
10
+ "bd3lm_attn_backend": "sdpa",
11
+ "bd3lm_block_aligned_eos": true,
12
+ "bd3lm_block_size": 64,
13
+ "bd3lm_complementary_mask": false,
14
+ "bd3lm_cross_attn": true,
15
+ "bd3lm_ignore_bos": true,
16
+ "bd3lm_mask_prob": 0.5,
17
+ "bd3lm_noise_granularity": "block",
18
+ "bd3lm_noise_type": "loglinear",
19
+ "bd3lm_parameterization": "subs",
20
+ "bd3lm_resample": false,
21
+ "bd3lm_sampling_eps_max": 1.0,
22
+ "bd3lm_sampling_eps_min": 0.001,
23
+ "bd3lm_time_conditioning": false,
24
+ "bd3lm_token_shift_prediction": false,
25
+ "bd3lm_var_min": true,
26
+ "bos_token_id": 151643,
27
+ "enable_bd3lm": true,
28
+ "enable_block_size_annealing": false,
29
+ "enable_mtd": false,
30
+ "enable_noise_level_annealing": false,
31
+ "eos_token_id": 151645,
32
+ "faster_token_stride": 10,
33
+ "force_sample": false,
34
+ "hidden_act": "silu",
35
+ "hidden_size": 2048,
36
+ "image_aspect_ratio": "pad",
37
+ "image_crop_resolution": null,
38
+ "image_grid_pinpoints": null,
39
+ "image_split_resolution": null,
40
+ "image_token_id": null,
41
+ "initializer_range": 0.02,
42
+ "intermediate_size": 11008,
43
+ "layer_types": [
44
+ "full_attention",
45
+ "full_attention",
46
+ "full_attention",
47
+ "full_attention",
48
+ "full_attention",
49
+ "full_attention",
50
+ "full_attention",
51
+ "full_attention",
52
+ "full_attention",
53
+ "full_attention",
54
+ "full_attention",
55
+ "full_attention",
56
+ "full_attention",
57
+ "full_attention",
58
+ "full_attention",
59
+ "full_attention",
60
+ "full_attention",
61
+ "full_attention",
62
+ "full_attention",
63
+ "full_attention",
64
+ "full_attention",
65
+ "full_attention",
66
+ "full_attention",
67
+ "full_attention",
68
+ "full_attention",
69
+ "full_attention",
70
+ "full_attention",
71
+ "full_attention",
72
+ "full_attention",
73
+ "full_attention",
74
+ "full_attention",
75
+ "full_attention",
76
+ "full_attention",
77
+ "full_attention",
78
+ "full_attention",
79
+ "full_attention"
80
+ ],
81
+ "max_pixels": 262144,
82
+ "max_position_embeddings": 128000,
83
+ "max_window_layers": 70,
84
+ "min_pixels": 147456,
85
+ "mm_hidden_size": 1280,
86
+ "mm_newline_position": "grid",
87
+ "mm_patch_merge_type": "flat",
88
+ "mm_projector_lr": null,
89
+ "mm_projector_type": "qwen_merger",
90
+ "mm_resampler_type": null,
91
+ "mm_spatial_pool_mode": "bilinear",
92
+ "mm_spatial_pool_stride": null,
93
+ "mm_tunable_parts": "mm_vision_tower,mm_mlp_adapter,mm_language_model",
94
+ "mm_use_im_patch_token": false,
95
+ "mm_use_im_start_end": false,
96
+ "mm_vision_select_feature": "patch",
97
+ "mm_vision_select_layer": -2,
98
+ "mm_vision_tower": "/inspire/hdd/global_user/chaimingxu-240108540141/models/DiffusionVL-Qwen2.5VL-3B",
99
+ "mm_vision_tower_lr": 2e-06,
100
+ "model_max_length": 8192,
101
+ "model_type": "diffusionvl_qwenvl",
102
+ "num_attention_heads": 16,
103
+ "num_hidden_layers": 36,
104
+ "num_key_value_heads": 2,
105
+ "pos_skipping_range": 4096,
106
+ "rms_norm_eps": 1e-06,
107
+ "rope_scaling": {
108
+ "mrope_section": [
109
+ 16,
110
+ 24,
111
+ 24
112
+ ],
113
+ "rope_type": "default",
114
+ "type": "default"
115
+ },
116
+ "rope_theta": 1000000.0,
117
+ "sliding_window": null,
118
+ "text_config": {
119
+ "architectures": [
120
+ "Qwen2_5_VLForConditionalGeneration"
121
+ ],
122
+ "attention_dropout": 0.0,
123
+ "bos_token_id": 151643,
124
+ "eos_token_id": 151645,
125
+ "hidden_act": "silu",
126
+ "hidden_size": 2048,
127
+ "image_token_id": null,
128
+ "initializer_range": 0.02,
129
+ "intermediate_size": 11008,
130
+ "layer_types": [
131
+ "full_attention",
132
+ "full_attention",
133
+ "full_attention",
134
+ "full_attention",
135
+ "full_attention",
136
+ "full_attention",
137
+ "full_attention",
138
+ "full_attention",
139
+ "full_attention",
140
+ "full_attention",
141
+ "full_attention",
142
+ "full_attention",
143
+ "full_attention",
144
+ "full_attention",
145
+ "full_attention",
146
+ "full_attention",
147
+ "full_attention",
148
+ "full_attention",
149
+ "full_attention",
150
+ "full_attention",
151
+ "full_attention",
152
+ "full_attention",
153
+ "full_attention",
154
+ "full_attention",
155
+ "full_attention",
156
+ "full_attention",
157
+ "full_attention",
158
+ "full_attention",
159
+ "full_attention",
160
+ "full_attention",
161
+ "full_attention",
162
+ "full_attention",
163
+ "full_attention",
164
+ "full_attention",
165
+ "full_attention",
166
+ "full_attention"
167
+ ],
168
+ "max_position_embeddings": 128000,
169
+ "max_window_layers": 70,
170
+ "model_type": "qwen2_5_vl_text",
171
+ "num_attention_heads": 16,
172
+ "num_hidden_layers": 36,
173
+ "num_key_value_heads": 2,
174
+ "rms_norm_eps": 1e-06,
175
+ "rope_scaling": {
176
+ "mrope_section": [
177
+ 16,
178
+ 24,
179
+ 24
180
+ ],
181
+ "rope_type": "default",
182
+ "type": "default"
183
+ },
184
+ "rope_theta": 1000000.0,
185
+ "sliding_window": null,
186
+ "tie_word_embeddings": true,
187
+ "torch_dtype": "float32",
188
+ "use_cache": true,
189
+ "use_sliding_window": false,
190
+ "video_token_id": null,
191
+ "vision_end_token_id": 151653,
192
+ "vision_start_token_id": 151652,
193
+ "vision_token_id": 151654,
194
+ "vocab_size": 151936
195
+ },
196
+ "tie_word_embeddings": true,
197
+ "tokenizer_model_max_length": 8192,
198
+ "tokenizer_padding_side": "right",
199
+ "torch_dtype": "bfloat16",
200
+ "transformers_version": "4.55.0",
201
+ "use_cache": true,
202
+ "use_mm_proj": true,
203
+ "use_pos_skipping": false,
204
+ "use_sliding_window": false,
205
+ "video_token_id": null,
206
+ "vision_config": {
207
+ "depth": 32,
208
+ "fullatt_block_indexes": [
209
+ 7,
210
+ 15,
211
+ 23,
212
+ 31
213
+ ],
214
+ "hidden_act": "silu",
215
+ "hidden_size": 1280,
216
+ "in_channels": 3,
217
+ "in_chans": 3,
218
+ "initializer_range": 0.02,
219
+ "intermediate_size": 3420,
220
+ "model_type": "",
221
+ "num_heads": 16,
222
+ "out_hidden_size": 2048,
223
+ "patch_size": 14,
224
+ "spatial_merge_size": 2,
225
+ "spatial_patch_size": 14,
226
+ "temporal_patch_size": 2,
227
+ "tokens_per_second": 2,
228
+ "torch_dtype": "float32",
229
+ "window_size": 112
230
+ },
231
+ "vision_end_token_id": 151653,
232
+ "vision_start_token_id": 151652,
233
+ "vision_token_id": 151654,
234
+ "vision_tower_pretrained": null,
235
+ "vocab_size": 151936,
236
+ "mask_token_id": 151671,
237
+ "auto_map": {
238
+ "AutoConfig": "configuration_diffusionvl_qwen2_5_vl.DiffusionVL_Qwen2_5_VL_Config",
239
+ "AutoModelForCausalLM": "modeling_diffusionvl_qwen2_5_vl.DiffusionVL_Qwen2_5_VL_ForConditionalGeneration",
240
+ "AutoProcessor": "processing_diffusionvl_qwen2_5_vl.DiffusionVL_Qwen2_5_VL_Processor"
241
+ }
242
+ }
configuration_diffusionvl_qwen2_5_vl.py ADDED
@@ -0,0 +1,284 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 The HustVL Team and The HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on Qwen2.5-VL, which is derived from EleutherAI's GPT-NeoX library
5
+ # and the GPT-NeoX and OPT implementations. It has been modified to create DiffusionVL.
6
+ #
7
+ # Licensed under the Apache License, Version 2.0 (the "License");
8
+ # you may not use this file except in compliance with the License.
9
+ # You may obtain a copy of the License at
10
+ #
11
+ # http://www.apache.org/licenses/LICENSE-2.0
12
+ #
13
+ # Unless required by applicable law or agreed to in writing, software
14
+ # distributed under the License is distributed on an "AS IS" BASIS,
15
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16
+ # See the License for the specific language governing permissions and
17
+ # limitations under the License.
18
+
19
+ """DiffusionVL (Qwen2.5-VL based) model configuration."""
20
+
21
+ from typing import List, Optional, Union
22
+
23
+ from transformers.configuration_utils import PretrainedConfig
24
+
25
+
26
+ class DiffusionVL_Qwen2_5_VL_VisionConfig(PretrainedConfig):
27
+ r"""
28
+ This is the configuration class to store the configuration of a [`DiffusionVL_Qwen2_5_VL_VisionModel`].
29
+ It is used to instantiate the vision encoder according to the specified arguments.
30
+
31
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs.
32
+ Read the documentation from [`PretrainedConfig`] for more information.
33
+
34
+ Args:
35
+ depth (`int`, *optional*, defaults to 32):
36
+ Number of vision transformer layers.
37
+ hidden_size (`int`, *optional*, defaults to 1280):
38
+ Dimensionality of the encoder layers and the pooler layer.
39
+ hidden_act (`str`, *optional*, defaults to `"silu"`):
40
+ The non-linear activation function in the encoder.
41
+ intermediate_size (`int`, *optional*, defaults to 3420):
42
+ Dimensionality of the "intermediate" (i.e., feed-forward) layer.
43
+ num_heads (`int`, *optional*, defaults to 16):
44
+ Number of attention heads for each attention layer.
45
+ in_channels (`int`, *optional*, defaults to 3):
46
+ Number of input channels.
47
+ patch_size (`int`, *optional*, defaults to 14):
48
+ The size of each image patch.
49
+ spatial_merge_size (`int`, *optional*, defaults to 2):
50
+ The spatial merge size for patch merging.
51
+ temporal_patch_size (`int`, *optional*, defaults to 2):
52
+ The temporal patch size for video processing.
53
+ tokens_per_second (`int`, *optional*, defaults to 4):
54
+ Number of tokens per second for video processing.
55
+ window_size (`int`, *optional*, defaults to 112):
56
+ Window size for windowed attention.
57
+ out_hidden_size (`int`, *optional*, defaults to 3584):
58
+ Output hidden size after the vision encoder.
59
+ fullatt_block_indexes (`List[int]`, *optional*):
60
+ Indices of blocks that use full attention instead of windowed attention.
61
+ initializer_range (`float`, *optional*, defaults to 0.02):
62
+ The standard deviation of the truncated_normal_initializer for initializing weight matrices.
63
+
64
+ Example:
65
+
66
+ ```python
67
+ >>> from configuration_diffusionvl_qwen2_5_vl import DiffusionVL_Qwen2_5_VL_VisionConfig
68
+
69
+ >>> # Initializing a DiffusionVL vision configuration
70
+ >>> configuration = DiffusionVL_Qwen2_5_VL_VisionConfig()
71
+ ```
72
+ """
73
+
74
+ model_type = "diffusionvl_qwen2_5_vl_vision"
75
+ base_config_key = "vision_config"
76
+
77
+ def __init__(
78
+ self,
79
+ depth: int = 32,
80
+ hidden_size: int = 1280,
81
+ hidden_act: str = "silu",
82
+ intermediate_size: int = 3420,
83
+ num_heads: int = 16,
84
+ in_channels: int = 3,
85
+ patch_size: int = 14,
86
+ spatial_merge_size: int = 2,
87
+ temporal_patch_size: int = 2,
88
+ tokens_per_second: int = 4,
89
+ window_size: int = 112,
90
+ out_hidden_size: int = 3584,
91
+ fullatt_block_indexes: Optional[List[int]] = None,
92
+ initializer_range: float = 0.02,
93
+ **kwargs,
94
+ ):
95
+ super().__init__(**kwargs)
96
+
97
+ self.depth = depth
98
+ self.hidden_size = hidden_size
99
+ self.hidden_act = hidden_act
100
+ self.intermediate_size = intermediate_size
101
+ self.num_heads = num_heads
102
+ self.in_channels = in_channels
103
+ self.patch_size = patch_size
104
+ self.spatial_merge_size = spatial_merge_size
105
+ self.temporal_patch_size = temporal_patch_size
106
+ self.tokens_per_second = tokens_per_second
107
+ self.window_size = window_size
108
+ self.out_hidden_size = out_hidden_size
109
+ self.fullatt_block_indexes = fullatt_block_indexes or [7, 15, 23, 31]
110
+ self.initializer_range = initializer_range
111
+
112
+
113
+ class DiffusionVL_Qwen2_5_VL_Config(PretrainedConfig):
114
+ r"""
115
+ This is the configuration class to store the configuration of a [`DiffusionVL_Qwen2_5_VL_ForConditionalGeneration`].
116
+ It is used to instantiate a DiffusionVL model according to the specified arguments.
117
+
118
+ DiffusionVL extends Qwen2.5-VL architecture with BD3LM (Block Diffusion Language Model)
119
+ for diffusion-based text generation instead of autoregressive decoding.
120
+
121
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs.
122
+ Read the documentation from [`PretrainedConfig`] for more information.
123
+
124
+ Args:
125
+ vocab_size (`int`, *optional*, defaults to 152064):
126
+ Vocabulary size of the DiffusionVL model.
127
+ hidden_size (`int`, *optional*, defaults to 3584):
128
+ Dimension of the hidden representations.
129
+ intermediate_size (`int`, *optional*, defaults to 18944):
130
+ Dimension of the MLP representations.
131
+ num_hidden_layers (`int`, *optional*, defaults to 28):
132
+ Number of hidden layers in the Transformer encoder.
133
+ num_attention_heads (`int`, *optional*, defaults to 28):
134
+ Number of attention heads for each attention layer.
135
+ num_key_value_heads (`int`, *optional*, defaults to 4):
136
+ Number of key-value heads for Grouped Query Attention (GQA).
137
+ hidden_act (`str`, *optional*, defaults to `"silu"`):
138
+ The non-linear activation function in the decoder.
139
+ max_position_embeddings (`int`, *optional*, defaults to 128000):
140
+ The maximum sequence length that this model might ever be used with.
141
+ initializer_range (`float`, *optional*, defaults to 0.02):
142
+ The standard deviation of the truncated_normal_initializer for initializing weight matrices.
143
+ rms_norm_eps (`float`, *optional*, defaults to 1e-6):
144
+ The epsilon used by the RMS normalization layers.
145
+ use_cache (`bool`, *optional*, defaults to `True`):
146
+ Whether to use the past key/values attentions.
147
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
148
+ Whether the model's input and output word embeddings should be tied.
149
+ attention_dropout (`float`, *optional*, defaults to 0.0):
150
+ The dropout ratio for the attention probabilities.
151
+ vision_config (`DiffusionVL_Qwen2_5_VL_VisionConfig`, *optional*):
152
+ The configuration for the vision encoder.
153
+ image_token_id (`int`, *optional*, defaults to 151655):
154
+ The token index for image placeholder.
155
+ video_token_id (`int`, *optional*, defaults to 151656):
156
+ The token index for video placeholder.
157
+ vision_start_token_id (`int`, *optional*, defaults to 151652):
158
+ The token index denoting start of vision input.
159
+ vision_end_token_id (`int`, *optional*, defaults to 151653):
160
+ The token index denoting end of vision input.
161
+ enable_bd3lm (`bool`, *optional*, defaults to `True`):
162
+ Whether to enable BD3LM diffusion-based generation.
163
+ bd3lm_block_size (`int`, *optional*, defaults to 8):
164
+ Block size for BD3LM generation.
165
+ bd3lm_cross_attn (`bool`, *optional*, defaults to `True`):
166
+ Whether to use cross-attention in BD3LM.
167
+ mask_token_id (`int`, *optional*, defaults to 151671):
168
+ The token index for mask token used in diffusion.
169
+ rope_theta (`float`, *optional*, defaults to 1000000.0):
170
+ The base period of the RoPE embeddings.
171
+ rope_scaling (`Dict`, *optional*):
172
+ Dictionary containing the scaling configuration for RoPE embeddings.
173
+
174
+ Example:
175
+
176
+ ```python
177
+ >>> from transformers import AutoModelForCausalLM
178
+ >>> from configuration_diffusionvl_qwen2_5_vl import DiffusionVL_Qwen2_5_VL_Config
179
+
180
+ >>> # Initializing a DiffusionVL configuration
181
+ >>> configuration = DiffusionVL_Qwen2_5_VL_Config()
182
+
183
+ >>> # Initializing a model from the configuration
184
+ >>> model = AutoModelForCausalLM.from_pretrained(
185
+ ... "path/to/model", config=configuration, trust_remote_code=True
186
+ ... )
187
+
188
+ >>> # Accessing the model configuration
189
+ >>> configuration = model.config
190
+ ```
191
+ """
192
+
193
+ model_type = "diffusionvl_qwenvl"
194
+ sub_configs = {"vision_config": DiffusionVL_Qwen2_5_VL_VisionConfig}
195
+ keys_to_ignore_at_inference = ["past_key_values"]
196
+
197
+ def __init__(
198
+ self,
199
+ vocab_size: int = 152064,
200
+ hidden_size: int = 3584,
201
+ intermediate_size: int = 18944,
202
+ num_hidden_layers: int = 28,
203
+ num_attention_heads: int = 28,
204
+ num_key_value_heads: int = 4,
205
+ hidden_act: str = "silu",
206
+ max_position_embeddings: int = 128000,
207
+ initializer_range: float = 0.02,
208
+ rms_norm_eps: float = 1e-6,
209
+ use_cache: bool = True,
210
+ tie_word_embeddings: bool = False,
211
+ attention_dropout: float = 0.0,
212
+ # Vision configuration
213
+ vision_config: Optional[Union[DiffusionVL_Qwen2_5_VL_VisionConfig, dict]] = None,
214
+ # Multimodal token IDs
215
+ image_token_id: int = 151655,
216
+ video_token_id: int = 151656,
217
+ vision_start_token_id: int = 151652,
218
+ vision_end_token_id: int = 151653,
219
+ # BD3LM diffusion parameters
220
+ enable_bd3lm: bool = True,
221
+ bd3lm_block_size: int = 8,
222
+ bd3lm_cross_attn: bool = True,
223
+ bd3lm_antithetic_sampling: bool = True,
224
+ bd3lm_sampling_eps_min: float = 1e-3,
225
+ bd3lm_sampling_eps_max: float = 1.0,
226
+ mask_token_id: int = 151671,
227
+ # RoPE parameters
228
+ rope_theta: float = 1000000.0,
229
+ rope_scaling: Optional[dict] = None,
230
+ **kwargs,
231
+ ):
232
+ # Remove text_config from kwargs to avoid GenerationConfig issues
233
+ # (text_config is only needed for train code, HF config uses flattened params)
234
+ kwargs.pop("text_config", None)
235
+
236
+ # Text model configuration
237
+ self.vocab_size = vocab_size
238
+ self.hidden_size = hidden_size
239
+ self.intermediate_size = intermediate_size
240
+ self.num_hidden_layers = num_hidden_layers
241
+ self.num_attention_heads = num_attention_heads
242
+ self.num_key_value_heads = num_key_value_heads
243
+ self.hidden_act = hidden_act
244
+ self.max_position_embeddings = max_position_embeddings
245
+ self.initializer_range = initializer_range
246
+ self.rms_norm_eps = rms_norm_eps
247
+ self.use_cache = use_cache
248
+ self.attention_dropout = attention_dropout
249
+ self.rope_theta = rope_theta
250
+ self.rope_scaling = rope_scaling or {
251
+ "mrope_section": [16, 24, 24],
252
+ "rope_type": "default",
253
+ "type": "default",
254
+ }
255
+
256
+ # Vision configuration
257
+ if vision_config is None:
258
+ self.vision_config = DiffusionVL_Qwen2_5_VL_VisionConfig()
259
+ elif isinstance(vision_config, dict):
260
+ self.vision_config = DiffusionVL_Qwen2_5_VL_VisionConfig(**vision_config)
261
+ elif isinstance(vision_config, DiffusionVL_Qwen2_5_VL_VisionConfig):
262
+ self.vision_config = vision_config
263
+ else:
264
+ self.vision_config = DiffusionVL_Qwen2_5_VL_VisionConfig()
265
+
266
+ # Multimodal token IDs
267
+ self.image_token_id = image_token_id
268
+ self.video_token_id = video_token_id
269
+ self.vision_start_token_id = vision_start_token_id
270
+ self.vision_end_token_id = vision_end_token_id
271
+
272
+ # BD3LM diffusion configuration
273
+ self.enable_bd3lm = enable_bd3lm
274
+ self.bd3lm_block_size = bd3lm_block_size
275
+ self.bd3lm_cross_attn = bd3lm_cross_attn
276
+ self.bd3lm_antithetic_sampling = bd3lm_antithetic_sampling
277
+ self.bd3lm_sampling_eps_min = bd3lm_sampling_eps_min
278
+ self.bd3lm_sampling_eps_max = bd3lm_sampling_eps_max
279
+ self.mask_token_id = mask_token_id
280
+
281
+ super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
282
+
283
+
284
+ __all__ = ["DiffusionVL_Qwen2_5_VL_Config", "DiffusionVL_Qwen2_5_VL_VisionConfig"]
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4d81a671904cbb52bfbdea35c481f647861ac796d1c98e26b746fb68f11f6bef
3
+ size 4957560304
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:585c011ef5e6151ade29126c8fbeb317622f1533ddec9b1e1d3b5c3b6e9511e8
3
+ size 3174117400
model.safetensors.index.json ADDED
@@ -0,0 +1,832 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_parameters": 3754622976,
4
+ "total_size": 7509245960
5
+ },
6
+ "weight_map": {
7
+ "model.embed_tokens.weight": "model-00001-of-00002.safetensors",
8
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
9
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
10
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
11
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
12
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
13
+ "model.layers.0.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
14
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
15
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
16
+ "model.layers.0.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
17
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
18
+ "model.layers.0.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
19
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
20
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
21
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
22
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
23
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
24
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
25
+ "model.layers.1.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
26
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
27
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
28
+ "model.layers.1.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
29
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
30
+ "model.layers.1.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
31
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
32
+ "model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
33
+ "model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
34
+ "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
35
+ "model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
36
+ "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
37
+ "model.layers.10.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
38
+ "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
39
+ "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
40
+ "model.layers.10.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
41
+ "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
42
+ "model.layers.10.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
43
+ "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
44
+ "model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
45
+ "model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
46
+ "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
47
+ "model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
48
+ "model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
49
+ "model.layers.11.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
50
+ "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
51
+ "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
52
+ "model.layers.11.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
53
+ "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
54
+ "model.layers.11.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
55
+ "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
56
+ "model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
57
+ "model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
58
+ "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
59
+ "model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
60
+ "model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
61
+ "model.layers.12.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
62
+ "model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
63
+ "model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
64
+ "model.layers.12.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
65
+ "model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
66
+ "model.layers.12.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
67
+ "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
68
+ "model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
69
+ "model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
70
+ "model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
71
+ "model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
72
+ "model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
73
+ "model.layers.13.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
74
+ "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
75
+ "model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
76
+ "model.layers.13.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
77
+ "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
78
+ "model.layers.13.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
79
+ "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
80
+ "model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
81
+ "model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
82
+ "model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
83
+ "model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
84
+ "model.layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
85
+ "model.layers.14.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
86
+ "model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
87
+ "model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
88
+ "model.layers.14.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
89
+ "model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
90
+ "model.layers.14.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
91
+ "model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
92
+ "model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
93
+ "model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
94
+ "model.layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
95
+ "model.layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
96
+ "model.layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
97
+ "model.layers.15.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
98
+ "model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
99
+ "model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
100
+ "model.layers.15.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
101
+ "model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
102
+ "model.layers.15.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
103
+ "model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
104
+ "model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
105
+ "model.layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
106
+ "model.layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
107
+ "model.layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
108
+ "model.layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
109
+ "model.layers.16.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
110
+ "model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
111
+ "model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
112
+ "model.layers.16.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
113
+ "model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
114
+ "model.layers.16.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
115
+ "model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
116
+ "model.layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
117
+ "model.layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
118
+ "model.layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
119
+ "model.layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
120
+ "model.layers.17.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
121
+ "model.layers.17.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
122
+ "model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
123
+ "model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
124
+ "model.layers.17.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
125
+ "model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
126
+ "model.layers.17.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
127
+ "model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
128
+ "model.layers.18.input_layernorm.weight": "model-00001-of-00002.safetensors",
129
+ "model.layers.18.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
130
+ "model.layers.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
131
+ "model.layers.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
132
+ "model.layers.18.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
133
+ "model.layers.18.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
134
+ "model.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
135
+ "model.layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
136
+ "model.layers.18.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
137
+ "model.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
138
+ "model.layers.18.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
139
+ "model.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
140
+ "model.layers.19.input_layernorm.weight": "model-00001-of-00002.safetensors",
141
+ "model.layers.19.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
142
+ "model.layers.19.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
143
+ "model.layers.19.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
144
+ "model.layers.19.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
145
+ "model.layers.19.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
146
+ "model.layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
147
+ "model.layers.19.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
148
+ "model.layers.19.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
149
+ "model.layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
150
+ "model.layers.19.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
151
+ "model.layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
152
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
153
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
154
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
155
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
156
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
157
+ "model.layers.2.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
158
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
159
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
160
+ "model.layers.2.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
161
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
162
+ "model.layers.2.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
163
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
164
+ "model.layers.20.input_layernorm.weight": "model-00001-of-00002.safetensors",
165
+ "model.layers.20.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
166
+ "model.layers.20.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
167
+ "model.layers.20.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
168
+ "model.layers.20.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
169
+ "model.layers.20.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
170
+ "model.layers.20.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
171
+ "model.layers.20.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
172
+ "model.layers.20.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
173
+ "model.layers.20.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
174
+ "model.layers.20.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
175
+ "model.layers.20.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
176
+ "model.layers.21.input_layernorm.weight": "model-00001-of-00002.safetensors",
177
+ "model.layers.21.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
178
+ "model.layers.21.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
179
+ "model.layers.21.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
180
+ "model.layers.21.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
181
+ "model.layers.21.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
182
+ "model.layers.21.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
183
+ "model.layers.21.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
184
+ "model.layers.21.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
185
+ "model.layers.21.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
186
+ "model.layers.21.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
187
+ "model.layers.21.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
188
+ "model.layers.22.input_layernorm.weight": "model-00001-of-00002.safetensors",
189
+ "model.layers.22.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
190
+ "model.layers.22.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
191
+ "model.layers.22.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
192
+ "model.layers.22.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
193
+ "model.layers.22.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
194
+ "model.layers.22.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
195
+ "model.layers.22.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
196
+ "model.layers.22.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
197
+ "model.layers.22.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
198
+ "model.layers.22.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
199
+ "model.layers.22.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
200
+ "model.layers.23.input_layernorm.weight": "model-00001-of-00002.safetensors",
201
+ "model.layers.23.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
202
+ "model.layers.23.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
203
+ "model.layers.23.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
204
+ "model.layers.23.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
205
+ "model.layers.23.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
206
+ "model.layers.23.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
207
+ "model.layers.23.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
208
+ "model.layers.23.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
209
+ "model.layers.23.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
210
+ "model.layers.23.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
211
+ "model.layers.23.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
212
+ "model.layers.24.input_layernorm.weight": "model-00001-of-00002.safetensors",
213
+ "model.layers.24.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
214
+ "model.layers.24.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
215
+ "model.layers.24.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
216
+ "model.layers.24.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
217
+ "model.layers.24.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
218
+ "model.layers.24.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
219
+ "model.layers.24.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
220
+ "model.layers.24.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
221
+ "model.layers.24.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
222
+ "model.layers.24.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
223
+ "model.layers.24.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
224
+ "model.layers.25.input_layernorm.weight": "model-00001-of-00002.safetensors",
225
+ "model.layers.25.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
226
+ "model.layers.25.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
227
+ "model.layers.25.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
228
+ "model.layers.25.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
229
+ "model.layers.25.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
230
+ "model.layers.25.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
231
+ "model.layers.25.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
232
+ "model.layers.25.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
233
+ "model.layers.25.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
234
+ "model.layers.25.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
235
+ "model.layers.25.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
236
+ "model.layers.26.input_layernorm.weight": "model-00001-of-00002.safetensors",
237
+ "model.layers.26.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
238
+ "model.layers.26.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
239
+ "model.layers.26.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
240
+ "model.layers.26.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
241
+ "model.layers.26.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
242
+ "model.layers.26.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
243
+ "model.layers.26.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
244
+ "model.layers.26.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
245
+ "model.layers.26.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
246
+ "model.layers.26.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
247
+ "model.layers.26.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
248
+ "model.layers.27.input_layernorm.weight": "model-00001-of-00002.safetensors",
249
+ "model.layers.27.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
250
+ "model.layers.27.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
251
+ "model.layers.27.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
252
+ "model.layers.27.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
253
+ "model.layers.27.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
254
+ "model.layers.27.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
255
+ "model.layers.27.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
256
+ "model.layers.27.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
257
+ "model.layers.27.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
258
+ "model.layers.27.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
259
+ "model.layers.27.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
260
+ "model.layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
261
+ "model.layers.28.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
262
+ "model.layers.28.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
263
+ "model.layers.28.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
264
+ "model.layers.28.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
265
+ "model.layers.28.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
266
+ "model.layers.28.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
267
+ "model.layers.28.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
268
+ "model.layers.28.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
269
+ "model.layers.28.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
270
+ "model.layers.28.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
271
+ "model.layers.28.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
272
+ "model.layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
273
+ "model.layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
274
+ "model.layers.29.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
275
+ "model.layers.29.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
276
+ "model.layers.29.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
277
+ "model.layers.29.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
278
+ "model.layers.29.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
279
+ "model.layers.29.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
280
+ "model.layers.29.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
281
+ "model.layers.29.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
282
+ "model.layers.29.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
283
+ "model.layers.29.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
284
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
285
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
286
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
287
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
288
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
289
+ "model.layers.3.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
290
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
291
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
292
+ "model.layers.3.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
293
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
294
+ "model.layers.3.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
295
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
296
+ "model.layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
297
+ "model.layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
298
+ "model.layers.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
299
+ "model.layers.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
300
+ "model.layers.30.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
301
+ "model.layers.30.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
302
+ "model.layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
303
+ "model.layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
304
+ "model.layers.30.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
305
+ "model.layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
306
+ "model.layers.30.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
307
+ "model.layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
308
+ "model.layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
309
+ "model.layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
310
+ "model.layers.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
311
+ "model.layers.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
312
+ "model.layers.31.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
313
+ "model.layers.31.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
314
+ "model.layers.31.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
315
+ "model.layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
316
+ "model.layers.31.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
317
+ "model.layers.31.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
318
+ "model.layers.31.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
319
+ "model.layers.31.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
320
+ "model.layers.32.input_layernorm.weight": "model-00002-of-00002.safetensors",
321
+ "model.layers.32.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
322
+ "model.layers.32.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
323
+ "model.layers.32.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
324
+ "model.layers.32.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
325
+ "model.layers.32.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
326
+ "model.layers.32.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
327
+ "model.layers.32.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
328
+ "model.layers.32.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
329
+ "model.layers.32.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
330
+ "model.layers.32.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
331
+ "model.layers.32.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
332
+ "model.layers.33.input_layernorm.weight": "model-00002-of-00002.safetensors",
333
+ "model.layers.33.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
334
+ "model.layers.33.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
335
+ "model.layers.33.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
336
+ "model.layers.33.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
337
+ "model.layers.33.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
338
+ "model.layers.33.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
339
+ "model.layers.33.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
340
+ "model.layers.33.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
341
+ "model.layers.33.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
342
+ "model.layers.33.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
343
+ "model.layers.33.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
344
+ "model.layers.34.input_layernorm.weight": "model-00002-of-00002.safetensors",
345
+ "model.layers.34.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
346
+ "model.layers.34.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
347
+ "model.layers.34.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
348
+ "model.layers.34.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
349
+ "model.layers.34.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
350
+ "model.layers.34.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
351
+ "model.layers.34.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
352
+ "model.layers.34.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
353
+ "model.layers.34.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
354
+ "model.layers.34.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
355
+ "model.layers.34.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
356
+ "model.layers.35.input_layernorm.weight": "model-00002-of-00002.safetensors",
357
+ "model.layers.35.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
358
+ "model.layers.35.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
359
+ "model.layers.35.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
360
+ "model.layers.35.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
361
+ "model.layers.35.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
362
+ "model.layers.35.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
363
+ "model.layers.35.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
364
+ "model.layers.35.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
365
+ "model.layers.35.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
366
+ "model.layers.35.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
367
+ "model.layers.35.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
368
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
369
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
370
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
371
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
372
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
373
+ "model.layers.4.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
374
+ "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
375
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
376
+ "model.layers.4.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
377
+ "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
378
+ "model.layers.4.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
379
+ "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
380
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
381
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
382
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
383
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
384
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
385
+ "model.layers.5.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
386
+ "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
387
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
388
+ "model.layers.5.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
389
+ "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
390
+ "model.layers.5.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
391
+ "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
392
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
393
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
394
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
395
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
396
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
397
+ "model.layers.6.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
398
+ "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
399
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
400
+ "model.layers.6.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
401
+ "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
402
+ "model.layers.6.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
403
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
404
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
405
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
406
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
407
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
408
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
409
+ "model.layers.7.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
410
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
411
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
412
+ "model.layers.7.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
413
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
414
+ "model.layers.7.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
415
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
416
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
417
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
418
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
419
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
420
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
421
+ "model.layers.8.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
422
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
423
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
424
+ "model.layers.8.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
425
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
426
+ "model.layers.8.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
427
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
428
+ "model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
429
+ "model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
430
+ "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
431
+ "model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
432
+ "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
433
+ "model.layers.9.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
434
+ "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
435
+ "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
436
+ "model.layers.9.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
437
+ "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
438
+ "model.layers.9.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
439
+ "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
440
+ "model.mm_projector.merger.ln_q.weight": "model-00002-of-00002.safetensors",
441
+ "model.mm_projector.merger.mlp.0.bias": "model-00002-of-00002.safetensors",
442
+ "model.mm_projector.merger.mlp.0.weight": "model-00002-of-00002.safetensors",
443
+ "model.mm_projector.merger.mlp.2.bias": "model-00002-of-00002.safetensors",
444
+ "model.mm_projector.merger.mlp.2.weight": "model-00002-of-00002.safetensors",
445
+ "model.norm.weight": "model-00002-of-00002.safetensors",
446
+ "model.vision_tower.vision_tower.blocks.0.attn.proj.bias": "model-00002-of-00002.safetensors",
447
+ "model.vision_tower.vision_tower.blocks.0.attn.proj.weight": "model-00002-of-00002.safetensors",
448
+ "model.vision_tower.vision_tower.blocks.0.attn.qkv.bias": "model-00002-of-00002.safetensors",
449
+ "model.vision_tower.vision_tower.blocks.0.attn.qkv.weight": "model-00002-of-00002.safetensors",
450
+ "model.vision_tower.vision_tower.blocks.0.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
451
+ "model.vision_tower.vision_tower.blocks.0.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
452
+ "model.vision_tower.vision_tower.blocks.0.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
453
+ "model.vision_tower.vision_tower.blocks.0.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
454
+ "model.vision_tower.vision_tower.blocks.0.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
455
+ "model.vision_tower.vision_tower.blocks.0.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
456
+ "model.vision_tower.vision_tower.blocks.0.norm1.weight": "model-00002-of-00002.safetensors",
457
+ "model.vision_tower.vision_tower.blocks.0.norm2.weight": "model-00002-of-00002.safetensors",
458
+ "model.vision_tower.vision_tower.blocks.1.attn.proj.bias": "model-00002-of-00002.safetensors",
459
+ "model.vision_tower.vision_tower.blocks.1.attn.proj.weight": "model-00002-of-00002.safetensors",
460
+ "model.vision_tower.vision_tower.blocks.1.attn.qkv.bias": "model-00002-of-00002.safetensors",
461
+ "model.vision_tower.vision_tower.blocks.1.attn.qkv.weight": "model-00002-of-00002.safetensors",
462
+ "model.vision_tower.vision_tower.blocks.1.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
463
+ "model.vision_tower.vision_tower.blocks.1.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
464
+ "model.vision_tower.vision_tower.blocks.1.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
465
+ "model.vision_tower.vision_tower.blocks.1.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
466
+ "model.vision_tower.vision_tower.blocks.1.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
467
+ "model.vision_tower.vision_tower.blocks.1.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
468
+ "model.vision_tower.vision_tower.blocks.1.norm1.weight": "model-00002-of-00002.safetensors",
469
+ "model.vision_tower.vision_tower.blocks.1.norm2.weight": "model-00002-of-00002.safetensors",
470
+ "model.vision_tower.vision_tower.blocks.10.attn.proj.bias": "model-00002-of-00002.safetensors",
471
+ "model.vision_tower.vision_tower.blocks.10.attn.proj.weight": "model-00002-of-00002.safetensors",
472
+ "model.vision_tower.vision_tower.blocks.10.attn.qkv.bias": "model-00002-of-00002.safetensors",
473
+ "model.vision_tower.vision_tower.blocks.10.attn.qkv.weight": "model-00002-of-00002.safetensors",
474
+ "model.vision_tower.vision_tower.blocks.10.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
475
+ "model.vision_tower.vision_tower.blocks.10.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
476
+ "model.vision_tower.vision_tower.blocks.10.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
477
+ "model.vision_tower.vision_tower.blocks.10.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
478
+ "model.vision_tower.vision_tower.blocks.10.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
479
+ "model.vision_tower.vision_tower.blocks.10.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
480
+ "model.vision_tower.vision_tower.blocks.10.norm1.weight": "model-00002-of-00002.safetensors",
481
+ "model.vision_tower.vision_tower.blocks.10.norm2.weight": "model-00002-of-00002.safetensors",
482
+ "model.vision_tower.vision_tower.blocks.11.attn.proj.bias": "model-00002-of-00002.safetensors",
483
+ "model.vision_tower.vision_tower.blocks.11.attn.proj.weight": "model-00002-of-00002.safetensors",
484
+ "model.vision_tower.vision_tower.blocks.11.attn.qkv.bias": "model-00002-of-00002.safetensors",
485
+ "model.vision_tower.vision_tower.blocks.11.attn.qkv.weight": "model-00002-of-00002.safetensors",
486
+ "model.vision_tower.vision_tower.blocks.11.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
487
+ "model.vision_tower.vision_tower.blocks.11.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
488
+ "model.vision_tower.vision_tower.blocks.11.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
489
+ "model.vision_tower.vision_tower.blocks.11.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
490
+ "model.vision_tower.vision_tower.blocks.11.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
491
+ "model.vision_tower.vision_tower.blocks.11.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
492
+ "model.vision_tower.vision_tower.blocks.11.norm1.weight": "model-00002-of-00002.safetensors",
493
+ "model.vision_tower.vision_tower.blocks.11.norm2.weight": "model-00002-of-00002.safetensors",
494
+ "model.vision_tower.vision_tower.blocks.12.attn.proj.bias": "model-00002-of-00002.safetensors",
495
+ "model.vision_tower.vision_tower.blocks.12.attn.proj.weight": "model-00002-of-00002.safetensors",
496
+ "model.vision_tower.vision_tower.blocks.12.attn.qkv.bias": "model-00002-of-00002.safetensors",
497
+ "model.vision_tower.vision_tower.blocks.12.attn.qkv.weight": "model-00002-of-00002.safetensors",
498
+ "model.vision_tower.vision_tower.blocks.12.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
499
+ "model.vision_tower.vision_tower.blocks.12.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
500
+ "model.vision_tower.vision_tower.blocks.12.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
501
+ "model.vision_tower.vision_tower.blocks.12.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
502
+ "model.vision_tower.vision_tower.blocks.12.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
503
+ "model.vision_tower.vision_tower.blocks.12.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
504
+ "model.vision_tower.vision_tower.blocks.12.norm1.weight": "model-00002-of-00002.safetensors",
505
+ "model.vision_tower.vision_tower.blocks.12.norm2.weight": "model-00002-of-00002.safetensors",
506
+ "model.vision_tower.vision_tower.blocks.13.attn.proj.bias": "model-00002-of-00002.safetensors",
507
+ "model.vision_tower.vision_tower.blocks.13.attn.proj.weight": "model-00002-of-00002.safetensors",
508
+ "model.vision_tower.vision_tower.blocks.13.attn.qkv.bias": "model-00002-of-00002.safetensors",
509
+ "model.vision_tower.vision_tower.blocks.13.attn.qkv.weight": "model-00002-of-00002.safetensors",
510
+ "model.vision_tower.vision_tower.blocks.13.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
511
+ "model.vision_tower.vision_tower.blocks.13.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
512
+ "model.vision_tower.vision_tower.blocks.13.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
513
+ "model.vision_tower.vision_tower.blocks.13.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
514
+ "model.vision_tower.vision_tower.blocks.13.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
515
+ "model.vision_tower.vision_tower.blocks.13.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
516
+ "model.vision_tower.vision_tower.blocks.13.norm1.weight": "model-00002-of-00002.safetensors",
517
+ "model.vision_tower.vision_tower.blocks.13.norm2.weight": "model-00002-of-00002.safetensors",
518
+ "model.vision_tower.vision_tower.blocks.14.attn.proj.bias": "model-00002-of-00002.safetensors",
519
+ "model.vision_tower.vision_tower.blocks.14.attn.proj.weight": "model-00002-of-00002.safetensors",
520
+ "model.vision_tower.vision_tower.blocks.14.attn.qkv.bias": "model-00002-of-00002.safetensors",
521
+ "model.vision_tower.vision_tower.blocks.14.attn.qkv.weight": "model-00002-of-00002.safetensors",
522
+ "model.vision_tower.vision_tower.blocks.14.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
523
+ "model.vision_tower.vision_tower.blocks.14.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
524
+ "model.vision_tower.vision_tower.blocks.14.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
525
+ "model.vision_tower.vision_tower.blocks.14.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
526
+ "model.vision_tower.vision_tower.blocks.14.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
527
+ "model.vision_tower.vision_tower.blocks.14.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
528
+ "model.vision_tower.vision_tower.blocks.14.norm1.weight": "model-00002-of-00002.safetensors",
529
+ "model.vision_tower.vision_tower.blocks.14.norm2.weight": "model-00002-of-00002.safetensors",
530
+ "model.vision_tower.vision_tower.blocks.15.attn.proj.bias": "model-00002-of-00002.safetensors",
531
+ "model.vision_tower.vision_tower.blocks.15.attn.proj.weight": "model-00002-of-00002.safetensors",
532
+ "model.vision_tower.vision_tower.blocks.15.attn.qkv.bias": "model-00002-of-00002.safetensors",
533
+ "model.vision_tower.vision_tower.blocks.15.attn.qkv.weight": "model-00002-of-00002.safetensors",
534
+ "model.vision_tower.vision_tower.blocks.15.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
535
+ "model.vision_tower.vision_tower.blocks.15.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
536
+ "model.vision_tower.vision_tower.blocks.15.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
537
+ "model.vision_tower.vision_tower.blocks.15.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
538
+ "model.vision_tower.vision_tower.blocks.15.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
539
+ "model.vision_tower.vision_tower.blocks.15.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
540
+ "model.vision_tower.vision_tower.blocks.15.norm1.weight": "model-00002-of-00002.safetensors",
541
+ "model.vision_tower.vision_tower.blocks.15.norm2.weight": "model-00002-of-00002.safetensors",
542
+ "model.vision_tower.vision_tower.blocks.16.attn.proj.bias": "model-00002-of-00002.safetensors",
543
+ "model.vision_tower.vision_tower.blocks.16.attn.proj.weight": "model-00002-of-00002.safetensors",
544
+ "model.vision_tower.vision_tower.blocks.16.attn.qkv.bias": "model-00002-of-00002.safetensors",
545
+ "model.vision_tower.vision_tower.blocks.16.attn.qkv.weight": "model-00002-of-00002.safetensors",
546
+ "model.vision_tower.vision_tower.blocks.16.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
547
+ "model.vision_tower.vision_tower.blocks.16.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
548
+ "model.vision_tower.vision_tower.blocks.16.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
549
+ "model.vision_tower.vision_tower.blocks.16.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
550
+ "model.vision_tower.vision_tower.blocks.16.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
551
+ "model.vision_tower.vision_tower.blocks.16.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
552
+ "model.vision_tower.vision_tower.blocks.16.norm1.weight": "model-00002-of-00002.safetensors",
553
+ "model.vision_tower.vision_tower.blocks.16.norm2.weight": "model-00002-of-00002.safetensors",
554
+ "model.vision_tower.vision_tower.blocks.17.attn.proj.bias": "model-00002-of-00002.safetensors",
555
+ "model.vision_tower.vision_tower.blocks.17.attn.proj.weight": "model-00002-of-00002.safetensors",
556
+ "model.vision_tower.vision_tower.blocks.17.attn.qkv.bias": "model-00002-of-00002.safetensors",
557
+ "model.vision_tower.vision_tower.blocks.17.attn.qkv.weight": "model-00002-of-00002.safetensors",
558
+ "model.vision_tower.vision_tower.blocks.17.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
559
+ "model.vision_tower.vision_tower.blocks.17.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
560
+ "model.vision_tower.vision_tower.blocks.17.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
561
+ "model.vision_tower.vision_tower.blocks.17.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
562
+ "model.vision_tower.vision_tower.blocks.17.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
563
+ "model.vision_tower.vision_tower.blocks.17.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
564
+ "model.vision_tower.vision_tower.blocks.17.norm1.weight": "model-00002-of-00002.safetensors",
565
+ "model.vision_tower.vision_tower.blocks.17.norm2.weight": "model-00002-of-00002.safetensors",
566
+ "model.vision_tower.vision_tower.blocks.18.attn.proj.bias": "model-00002-of-00002.safetensors",
567
+ "model.vision_tower.vision_tower.blocks.18.attn.proj.weight": "model-00002-of-00002.safetensors",
568
+ "model.vision_tower.vision_tower.blocks.18.attn.qkv.bias": "model-00002-of-00002.safetensors",
569
+ "model.vision_tower.vision_tower.blocks.18.attn.qkv.weight": "model-00002-of-00002.safetensors",
570
+ "model.vision_tower.vision_tower.blocks.18.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
571
+ "model.vision_tower.vision_tower.blocks.18.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
572
+ "model.vision_tower.vision_tower.blocks.18.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
573
+ "model.vision_tower.vision_tower.blocks.18.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
574
+ "model.vision_tower.vision_tower.blocks.18.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
575
+ "model.vision_tower.vision_tower.blocks.18.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
576
+ "model.vision_tower.vision_tower.blocks.18.norm1.weight": "model-00002-of-00002.safetensors",
577
+ "model.vision_tower.vision_tower.blocks.18.norm2.weight": "model-00002-of-00002.safetensors",
578
+ "model.vision_tower.vision_tower.blocks.19.attn.proj.bias": "model-00002-of-00002.safetensors",
579
+ "model.vision_tower.vision_tower.blocks.19.attn.proj.weight": "model-00002-of-00002.safetensors",
580
+ "model.vision_tower.vision_tower.blocks.19.attn.qkv.bias": "model-00002-of-00002.safetensors",
581
+ "model.vision_tower.vision_tower.blocks.19.attn.qkv.weight": "model-00002-of-00002.safetensors",
582
+ "model.vision_tower.vision_tower.blocks.19.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
583
+ "model.vision_tower.vision_tower.blocks.19.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
584
+ "model.vision_tower.vision_tower.blocks.19.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
585
+ "model.vision_tower.vision_tower.blocks.19.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
586
+ "model.vision_tower.vision_tower.blocks.19.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
587
+ "model.vision_tower.vision_tower.blocks.19.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
588
+ "model.vision_tower.vision_tower.blocks.19.norm1.weight": "model-00002-of-00002.safetensors",
589
+ "model.vision_tower.vision_tower.blocks.19.norm2.weight": "model-00002-of-00002.safetensors",
590
+ "model.vision_tower.vision_tower.blocks.2.attn.proj.bias": "model-00002-of-00002.safetensors",
591
+ "model.vision_tower.vision_tower.blocks.2.attn.proj.weight": "model-00002-of-00002.safetensors",
592
+ "model.vision_tower.vision_tower.blocks.2.attn.qkv.bias": "model-00002-of-00002.safetensors",
593
+ "model.vision_tower.vision_tower.blocks.2.attn.qkv.weight": "model-00002-of-00002.safetensors",
594
+ "model.vision_tower.vision_tower.blocks.2.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
595
+ "model.vision_tower.vision_tower.blocks.2.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
596
+ "model.vision_tower.vision_tower.blocks.2.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
597
+ "model.vision_tower.vision_tower.blocks.2.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
598
+ "model.vision_tower.vision_tower.blocks.2.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
599
+ "model.vision_tower.vision_tower.blocks.2.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
600
+ "model.vision_tower.vision_tower.blocks.2.norm1.weight": "model-00002-of-00002.safetensors",
601
+ "model.vision_tower.vision_tower.blocks.2.norm2.weight": "model-00002-of-00002.safetensors",
602
+ "model.vision_tower.vision_tower.blocks.20.attn.proj.bias": "model-00002-of-00002.safetensors",
603
+ "model.vision_tower.vision_tower.blocks.20.attn.proj.weight": "model-00002-of-00002.safetensors",
604
+ "model.vision_tower.vision_tower.blocks.20.attn.qkv.bias": "model-00002-of-00002.safetensors",
605
+ "model.vision_tower.vision_tower.blocks.20.attn.qkv.weight": "model-00002-of-00002.safetensors",
606
+ "model.vision_tower.vision_tower.blocks.20.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
607
+ "model.vision_tower.vision_tower.blocks.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
608
+ "model.vision_tower.vision_tower.blocks.20.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
609
+ "model.vision_tower.vision_tower.blocks.20.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
610
+ "model.vision_tower.vision_tower.blocks.20.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
611
+ "model.vision_tower.vision_tower.blocks.20.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
612
+ "model.vision_tower.vision_tower.blocks.20.norm1.weight": "model-00002-of-00002.safetensors",
613
+ "model.vision_tower.vision_tower.blocks.20.norm2.weight": "model-00002-of-00002.safetensors",
614
+ "model.vision_tower.vision_tower.blocks.21.attn.proj.bias": "model-00002-of-00002.safetensors",
615
+ "model.vision_tower.vision_tower.blocks.21.attn.proj.weight": "model-00002-of-00002.safetensors",
616
+ "model.vision_tower.vision_tower.blocks.21.attn.qkv.bias": "model-00002-of-00002.safetensors",
617
+ "model.vision_tower.vision_tower.blocks.21.attn.qkv.weight": "model-00002-of-00002.safetensors",
618
+ "model.vision_tower.vision_tower.blocks.21.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
619
+ "model.vision_tower.vision_tower.blocks.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
620
+ "model.vision_tower.vision_tower.blocks.21.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
621
+ "model.vision_tower.vision_tower.blocks.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
622
+ "model.vision_tower.vision_tower.blocks.21.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
623
+ "model.vision_tower.vision_tower.blocks.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
624
+ "model.vision_tower.vision_tower.blocks.21.norm1.weight": "model-00002-of-00002.safetensors",
625
+ "model.vision_tower.vision_tower.blocks.21.norm2.weight": "model-00002-of-00002.safetensors",
626
+ "model.vision_tower.vision_tower.blocks.22.attn.proj.bias": "model-00002-of-00002.safetensors",
627
+ "model.vision_tower.vision_tower.blocks.22.attn.proj.weight": "model-00002-of-00002.safetensors",
628
+ "model.vision_tower.vision_tower.blocks.22.attn.qkv.bias": "model-00002-of-00002.safetensors",
629
+ "model.vision_tower.vision_tower.blocks.22.attn.qkv.weight": "model-00002-of-00002.safetensors",
630
+ "model.vision_tower.vision_tower.blocks.22.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
631
+ "model.vision_tower.vision_tower.blocks.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
632
+ "model.vision_tower.vision_tower.blocks.22.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
633
+ "model.vision_tower.vision_tower.blocks.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
634
+ "model.vision_tower.vision_tower.blocks.22.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
635
+ "model.vision_tower.vision_tower.blocks.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
636
+ "model.vision_tower.vision_tower.blocks.22.norm1.weight": "model-00002-of-00002.safetensors",
637
+ "model.vision_tower.vision_tower.blocks.22.norm2.weight": "model-00002-of-00002.safetensors",
638
+ "model.vision_tower.vision_tower.blocks.23.attn.proj.bias": "model-00002-of-00002.safetensors",
639
+ "model.vision_tower.vision_tower.blocks.23.attn.proj.weight": "model-00002-of-00002.safetensors",
640
+ "model.vision_tower.vision_tower.blocks.23.attn.qkv.bias": "model-00002-of-00002.safetensors",
641
+ "model.vision_tower.vision_tower.blocks.23.attn.qkv.weight": "model-00002-of-00002.safetensors",
642
+ "model.vision_tower.vision_tower.blocks.23.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
643
+ "model.vision_tower.vision_tower.blocks.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
644
+ "model.vision_tower.vision_tower.blocks.23.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
645
+ "model.vision_tower.vision_tower.blocks.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
646
+ "model.vision_tower.vision_tower.blocks.23.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
647
+ "model.vision_tower.vision_tower.blocks.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
648
+ "model.vision_tower.vision_tower.blocks.23.norm1.weight": "model-00002-of-00002.safetensors",
649
+ "model.vision_tower.vision_tower.blocks.23.norm2.weight": "model-00002-of-00002.safetensors",
650
+ "model.vision_tower.vision_tower.blocks.24.attn.proj.bias": "model-00002-of-00002.safetensors",
651
+ "model.vision_tower.vision_tower.blocks.24.attn.proj.weight": "model-00002-of-00002.safetensors",
652
+ "model.vision_tower.vision_tower.blocks.24.attn.qkv.bias": "model-00002-of-00002.safetensors",
653
+ "model.vision_tower.vision_tower.blocks.24.attn.qkv.weight": "model-00002-of-00002.safetensors",
654
+ "model.vision_tower.vision_tower.blocks.24.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
655
+ "model.vision_tower.vision_tower.blocks.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
656
+ "model.vision_tower.vision_tower.blocks.24.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
657
+ "model.vision_tower.vision_tower.blocks.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
658
+ "model.vision_tower.vision_tower.blocks.24.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
659
+ "model.vision_tower.vision_tower.blocks.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
660
+ "model.vision_tower.vision_tower.blocks.24.norm1.weight": "model-00002-of-00002.safetensors",
661
+ "model.vision_tower.vision_tower.blocks.24.norm2.weight": "model-00002-of-00002.safetensors",
662
+ "model.vision_tower.vision_tower.blocks.25.attn.proj.bias": "model-00002-of-00002.safetensors",
663
+ "model.vision_tower.vision_tower.blocks.25.attn.proj.weight": "model-00002-of-00002.safetensors",
664
+ "model.vision_tower.vision_tower.blocks.25.attn.qkv.bias": "model-00002-of-00002.safetensors",
665
+ "model.vision_tower.vision_tower.blocks.25.attn.qkv.weight": "model-00002-of-00002.safetensors",
666
+ "model.vision_tower.vision_tower.blocks.25.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
667
+ "model.vision_tower.vision_tower.blocks.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
668
+ "model.vision_tower.vision_tower.blocks.25.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
669
+ "model.vision_tower.vision_tower.blocks.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
670
+ "model.vision_tower.vision_tower.blocks.25.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
671
+ "model.vision_tower.vision_tower.blocks.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
672
+ "model.vision_tower.vision_tower.blocks.25.norm1.weight": "model-00002-of-00002.safetensors",
673
+ "model.vision_tower.vision_tower.blocks.25.norm2.weight": "model-00002-of-00002.safetensors",
674
+ "model.vision_tower.vision_tower.blocks.26.attn.proj.bias": "model-00002-of-00002.safetensors",
675
+ "model.vision_tower.vision_tower.blocks.26.attn.proj.weight": "model-00002-of-00002.safetensors",
676
+ "model.vision_tower.vision_tower.blocks.26.attn.qkv.bias": "model-00002-of-00002.safetensors",
677
+ "model.vision_tower.vision_tower.blocks.26.attn.qkv.weight": "model-00002-of-00002.safetensors",
678
+ "model.vision_tower.vision_tower.blocks.26.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
679
+ "model.vision_tower.vision_tower.blocks.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
680
+ "model.vision_tower.vision_tower.blocks.26.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
681
+ "model.vision_tower.vision_tower.blocks.26.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
682
+ "model.vision_tower.vision_tower.blocks.26.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
683
+ "model.vision_tower.vision_tower.blocks.26.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
684
+ "model.vision_tower.vision_tower.blocks.26.norm1.weight": "model-00002-of-00002.safetensors",
685
+ "model.vision_tower.vision_tower.blocks.26.norm2.weight": "model-00002-of-00002.safetensors",
686
+ "model.vision_tower.vision_tower.blocks.27.attn.proj.bias": "model-00002-of-00002.safetensors",
687
+ "model.vision_tower.vision_tower.blocks.27.attn.proj.weight": "model-00002-of-00002.safetensors",
688
+ "model.vision_tower.vision_tower.blocks.27.attn.qkv.bias": "model-00002-of-00002.safetensors",
689
+ "model.vision_tower.vision_tower.blocks.27.attn.qkv.weight": "model-00002-of-00002.safetensors",
690
+ "model.vision_tower.vision_tower.blocks.27.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
691
+ "model.vision_tower.vision_tower.blocks.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
692
+ "model.vision_tower.vision_tower.blocks.27.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
693
+ "model.vision_tower.vision_tower.blocks.27.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
694
+ "model.vision_tower.vision_tower.blocks.27.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
695
+ "model.vision_tower.vision_tower.blocks.27.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
696
+ "model.vision_tower.vision_tower.blocks.27.norm1.weight": "model-00002-of-00002.safetensors",
697
+ "model.vision_tower.vision_tower.blocks.27.norm2.weight": "model-00002-of-00002.safetensors",
698
+ "model.vision_tower.vision_tower.blocks.28.attn.proj.bias": "model-00002-of-00002.safetensors",
699
+ "model.vision_tower.vision_tower.blocks.28.attn.proj.weight": "model-00002-of-00002.safetensors",
700
+ "model.vision_tower.vision_tower.blocks.28.attn.qkv.bias": "model-00002-of-00002.safetensors",
701
+ "model.vision_tower.vision_tower.blocks.28.attn.qkv.weight": "model-00002-of-00002.safetensors",
702
+ "model.vision_tower.vision_tower.blocks.28.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
703
+ "model.vision_tower.vision_tower.blocks.28.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
704
+ "model.vision_tower.vision_tower.blocks.28.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
705
+ "model.vision_tower.vision_tower.blocks.28.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
706
+ "model.vision_tower.vision_tower.blocks.28.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
707
+ "model.vision_tower.vision_tower.blocks.28.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
708
+ "model.vision_tower.vision_tower.blocks.28.norm1.weight": "model-00002-of-00002.safetensors",
709
+ "model.vision_tower.vision_tower.blocks.28.norm2.weight": "model-00002-of-00002.safetensors",
710
+ "model.vision_tower.vision_tower.blocks.29.attn.proj.bias": "model-00002-of-00002.safetensors",
711
+ "model.vision_tower.vision_tower.blocks.29.attn.proj.weight": "model-00002-of-00002.safetensors",
712
+ "model.vision_tower.vision_tower.blocks.29.attn.qkv.bias": "model-00002-of-00002.safetensors",
713
+ "model.vision_tower.vision_tower.blocks.29.attn.qkv.weight": "model-00002-of-00002.safetensors",
714
+ "model.vision_tower.vision_tower.blocks.29.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
715
+ "model.vision_tower.vision_tower.blocks.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
716
+ "model.vision_tower.vision_tower.blocks.29.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
717
+ "model.vision_tower.vision_tower.blocks.29.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
718
+ "model.vision_tower.vision_tower.blocks.29.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
719
+ "model.vision_tower.vision_tower.blocks.29.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
720
+ "model.vision_tower.vision_tower.blocks.29.norm1.weight": "model-00002-of-00002.safetensors",
721
+ "model.vision_tower.vision_tower.blocks.29.norm2.weight": "model-00002-of-00002.safetensors",
722
+ "model.vision_tower.vision_tower.blocks.3.attn.proj.bias": "model-00002-of-00002.safetensors",
723
+ "model.vision_tower.vision_tower.blocks.3.attn.proj.weight": "model-00002-of-00002.safetensors",
724
+ "model.vision_tower.vision_tower.blocks.3.attn.qkv.bias": "model-00002-of-00002.safetensors",
725
+ "model.vision_tower.vision_tower.blocks.3.attn.qkv.weight": "model-00002-of-00002.safetensors",
726
+ "model.vision_tower.vision_tower.blocks.3.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
727
+ "model.vision_tower.vision_tower.blocks.3.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
728
+ "model.vision_tower.vision_tower.blocks.3.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
729
+ "model.vision_tower.vision_tower.blocks.3.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
730
+ "model.vision_tower.vision_tower.blocks.3.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
731
+ "model.vision_tower.vision_tower.blocks.3.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
732
+ "model.vision_tower.vision_tower.blocks.3.norm1.weight": "model-00002-of-00002.safetensors",
733
+ "model.vision_tower.vision_tower.blocks.3.norm2.weight": "model-00002-of-00002.safetensors",
734
+ "model.vision_tower.vision_tower.blocks.30.attn.proj.bias": "model-00002-of-00002.safetensors",
735
+ "model.vision_tower.vision_tower.blocks.30.attn.proj.weight": "model-00002-of-00002.safetensors",
736
+ "model.vision_tower.vision_tower.blocks.30.attn.qkv.bias": "model-00002-of-00002.safetensors",
737
+ "model.vision_tower.vision_tower.blocks.30.attn.qkv.weight": "model-00002-of-00002.safetensors",
738
+ "model.vision_tower.vision_tower.blocks.30.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
739
+ "model.vision_tower.vision_tower.blocks.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
740
+ "model.vision_tower.vision_tower.blocks.30.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
741
+ "model.vision_tower.vision_tower.blocks.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
742
+ "model.vision_tower.vision_tower.blocks.30.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
743
+ "model.vision_tower.vision_tower.blocks.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
744
+ "model.vision_tower.vision_tower.blocks.30.norm1.weight": "model-00002-of-00002.safetensors",
745
+ "model.vision_tower.vision_tower.blocks.30.norm2.weight": "model-00002-of-00002.safetensors",
746
+ "model.vision_tower.vision_tower.blocks.31.attn.proj.bias": "model-00002-of-00002.safetensors",
747
+ "model.vision_tower.vision_tower.blocks.31.attn.proj.weight": "model-00002-of-00002.safetensors",
748
+ "model.vision_tower.vision_tower.blocks.31.attn.qkv.bias": "model-00002-of-00002.safetensors",
749
+ "model.vision_tower.vision_tower.blocks.31.attn.qkv.weight": "model-00002-of-00002.safetensors",
750
+ "model.vision_tower.vision_tower.blocks.31.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
751
+ "model.vision_tower.vision_tower.blocks.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
752
+ "model.vision_tower.vision_tower.blocks.31.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
753
+ "model.vision_tower.vision_tower.blocks.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
754
+ "model.vision_tower.vision_tower.blocks.31.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
755
+ "model.vision_tower.vision_tower.blocks.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
756
+ "model.vision_tower.vision_tower.blocks.31.norm1.weight": "model-00002-of-00002.safetensors",
757
+ "model.vision_tower.vision_tower.blocks.31.norm2.weight": "model-00002-of-00002.safetensors",
758
+ "model.vision_tower.vision_tower.blocks.4.attn.proj.bias": "model-00002-of-00002.safetensors",
759
+ "model.vision_tower.vision_tower.blocks.4.attn.proj.weight": "model-00002-of-00002.safetensors",
760
+ "model.vision_tower.vision_tower.blocks.4.attn.qkv.bias": "model-00002-of-00002.safetensors",
761
+ "model.vision_tower.vision_tower.blocks.4.attn.qkv.weight": "model-00002-of-00002.safetensors",
762
+ "model.vision_tower.vision_tower.blocks.4.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
763
+ "model.vision_tower.vision_tower.blocks.4.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
764
+ "model.vision_tower.vision_tower.blocks.4.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
765
+ "model.vision_tower.vision_tower.blocks.4.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
766
+ "model.vision_tower.vision_tower.blocks.4.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
767
+ "model.vision_tower.vision_tower.blocks.4.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
768
+ "model.vision_tower.vision_tower.blocks.4.norm1.weight": "model-00002-of-00002.safetensors",
769
+ "model.vision_tower.vision_tower.blocks.4.norm2.weight": "model-00002-of-00002.safetensors",
770
+ "model.vision_tower.vision_tower.blocks.5.attn.proj.bias": "model-00002-of-00002.safetensors",
771
+ "model.vision_tower.vision_tower.blocks.5.attn.proj.weight": "model-00002-of-00002.safetensors",
772
+ "model.vision_tower.vision_tower.blocks.5.attn.qkv.bias": "model-00002-of-00002.safetensors",
773
+ "model.vision_tower.vision_tower.blocks.5.attn.qkv.weight": "model-00002-of-00002.safetensors",
774
+ "model.vision_tower.vision_tower.blocks.5.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
775
+ "model.vision_tower.vision_tower.blocks.5.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
776
+ "model.vision_tower.vision_tower.blocks.5.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
777
+ "model.vision_tower.vision_tower.blocks.5.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
778
+ "model.vision_tower.vision_tower.blocks.5.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
779
+ "model.vision_tower.vision_tower.blocks.5.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
780
+ "model.vision_tower.vision_tower.blocks.5.norm1.weight": "model-00002-of-00002.safetensors",
781
+ "model.vision_tower.vision_tower.blocks.5.norm2.weight": "model-00002-of-00002.safetensors",
782
+ "model.vision_tower.vision_tower.blocks.6.attn.proj.bias": "model-00002-of-00002.safetensors",
783
+ "model.vision_tower.vision_tower.blocks.6.attn.proj.weight": "model-00002-of-00002.safetensors",
784
+ "model.vision_tower.vision_tower.blocks.6.attn.qkv.bias": "model-00002-of-00002.safetensors",
785
+ "model.vision_tower.vision_tower.blocks.6.attn.qkv.weight": "model-00002-of-00002.safetensors",
786
+ "model.vision_tower.vision_tower.blocks.6.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
787
+ "model.vision_tower.vision_tower.blocks.6.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
788
+ "model.vision_tower.vision_tower.blocks.6.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
789
+ "model.vision_tower.vision_tower.blocks.6.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
790
+ "model.vision_tower.vision_tower.blocks.6.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
791
+ "model.vision_tower.vision_tower.blocks.6.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
792
+ "model.vision_tower.vision_tower.blocks.6.norm1.weight": "model-00002-of-00002.safetensors",
793
+ "model.vision_tower.vision_tower.blocks.6.norm2.weight": "model-00002-of-00002.safetensors",
794
+ "model.vision_tower.vision_tower.blocks.7.attn.proj.bias": "model-00002-of-00002.safetensors",
795
+ "model.vision_tower.vision_tower.blocks.7.attn.proj.weight": "model-00002-of-00002.safetensors",
796
+ "model.vision_tower.vision_tower.blocks.7.attn.qkv.bias": "model-00002-of-00002.safetensors",
797
+ "model.vision_tower.vision_tower.blocks.7.attn.qkv.weight": "model-00002-of-00002.safetensors",
798
+ "model.vision_tower.vision_tower.blocks.7.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
799
+ "model.vision_tower.vision_tower.blocks.7.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
800
+ "model.vision_tower.vision_tower.blocks.7.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
801
+ "model.vision_tower.vision_tower.blocks.7.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
802
+ "model.vision_tower.vision_tower.blocks.7.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
803
+ "model.vision_tower.vision_tower.blocks.7.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
804
+ "model.vision_tower.vision_tower.blocks.7.norm1.weight": "model-00002-of-00002.safetensors",
805
+ "model.vision_tower.vision_tower.blocks.7.norm2.weight": "model-00002-of-00002.safetensors",
806
+ "model.vision_tower.vision_tower.blocks.8.attn.proj.bias": "model-00002-of-00002.safetensors",
807
+ "model.vision_tower.vision_tower.blocks.8.attn.proj.weight": "model-00002-of-00002.safetensors",
808
+ "model.vision_tower.vision_tower.blocks.8.attn.qkv.bias": "model-00002-of-00002.safetensors",
809
+ "model.vision_tower.vision_tower.blocks.8.attn.qkv.weight": "model-00002-of-00002.safetensors",
810
+ "model.vision_tower.vision_tower.blocks.8.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
811
+ "model.vision_tower.vision_tower.blocks.8.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
812
+ "model.vision_tower.vision_tower.blocks.8.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
813
+ "model.vision_tower.vision_tower.blocks.8.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
814
+ "model.vision_tower.vision_tower.blocks.8.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
815
+ "model.vision_tower.vision_tower.blocks.8.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
816
+ "model.vision_tower.vision_tower.blocks.8.norm1.weight": "model-00002-of-00002.safetensors",
817
+ "model.vision_tower.vision_tower.blocks.8.norm2.weight": "model-00002-of-00002.safetensors",
818
+ "model.vision_tower.vision_tower.blocks.9.attn.proj.bias": "model-00002-of-00002.safetensors",
819
+ "model.vision_tower.vision_tower.blocks.9.attn.proj.weight": "model-00002-of-00002.safetensors",
820
+ "model.vision_tower.vision_tower.blocks.9.attn.qkv.bias": "model-00002-of-00002.safetensors",
821
+ "model.vision_tower.vision_tower.blocks.9.attn.qkv.weight": "model-00002-of-00002.safetensors",
822
+ "model.vision_tower.vision_tower.blocks.9.mlp.down_proj.bias": "model-00002-of-00002.safetensors",
823
+ "model.vision_tower.vision_tower.blocks.9.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
824
+ "model.vision_tower.vision_tower.blocks.9.mlp.gate_proj.bias": "model-00002-of-00002.safetensors",
825
+ "model.vision_tower.vision_tower.blocks.9.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
826
+ "model.vision_tower.vision_tower.blocks.9.mlp.up_proj.bias": "model-00002-of-00002.safetensors",
827
+ "model.vision_tower.vision_tower.blocks.9.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
828
+ "model.vision_tower.vision_tower.blocks.9.norm1.weight": "model-00002-of-00002.safetensors",
829
+ "model.vision_tower.vision_tower.blocks.9.norm2.weight": "model-00002-of-00002.safetensors",
830
+ "model.vision_tower.vision_tower.patch_embed.proj.weight": "model-00002-of-00002.safetensors"
831
+ }
832
+ }
modeling_diffusionvl_qwen2_5_vl.py ADDED
@@ -0,0 +1,1971 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 The HustVL Team and The HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on Qwen2.5-VL, which is derived from EleutherAI's GPT-NeoX library
5
+ # and the GPT-NeoX and OPT implementations. It has been modified to create DiffusionVL.
6
+ #
7
+ # Licensed under the Apache License, Version 2.0 (the "License");
8
+ # you may not use this file except in compliance with the License.
9
+ # You may obtain a copy of the License at
10
+ #
11
+ # http://www.apache.org/licenses/LICENSE-2.0
12
+ #
13
+ # Unless required by applicable law or agreed to in writing, software
14
+ # distributed under the License is distributed on an "AS IS" BASIS,
15
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16
+ # See the License for the specific language governing permissions and
17
+ # limitations under the License.
18
+
19
+ """DiffusionVL model implementation."""
20
+
21
+ import math
22
+ from dataclasses import dataclass
23
+ from typing import Callable, Dict, List, Optional, Tuple, Union
24
+
25
+ import torch
26
+ import torch.nn as nn
27
+ import torch.nn.functional as F
28
+
29
+ from transformers import PreTrainedModel
30
+ from transformers.activations import ACT2FN
31
+ from transformers.cache_utils import Cache, DynamicCache
32
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, ModelOutput
33
+ from transformers.utils import logging
34
+ from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
35
+ from transformers.modeling_layers import GradientCheckpointingLayer
36
+ from transformers.integrations import use_kernel_forward_from_hub
37
+
38
+ from .configuration_diffusionvl_qwen2_5_vl import DiffusionVL_Qwen2_5_VL_Config, DiffusionVL_Qwen2_5_VL_VisionConfig
39
+
40
+ IMAGE_TOKEN_INDEX = -200
41
+
42
+ def rotate_half(x: torch.Tensor) -> torch.Tensor:
43
+ """
44
+ Rotates half the hidden dims of the input for rotary position embedding.
45
+
46
+ Args:
47
+ x: Input tensor of shape (..., head_dim).
48
+
49
+ Returns:
50
+ Rotated tensor of the same shape.
51
+ """
52
+ x1 = x[..., : x.shape[-1] // 2]
53
+ x2 = x[..., x.shape[-1] // 2 :]
54
+ return torch.cat((-x2, x1), dim=-1)
55
+
56
+
57
+ def apply_rotary_pos_emb_vision(
58
+ q: torch.Tensor,
59
+ k: torch.Tensor,
60
+ cos: torch.Tensor,
61
+ sin: torch.Tensor,
62
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
63
+ """
64
+ Apply rotary position embedding for vision encoder.
65
+
66
+ Args:
67
+ q: Query tensor.
68
+ k: Key tensor.
69
+ cos: Cosine part of rotary embedding.
70
+ sin: Sine part of rotary embedding.
71
+
72
+ Returns:
73
+ Tuple of (rotated_q, rotated_k).
74
+ """
75
+ orig_q_dtype = q.dtype
76
+ orig_k_dtype = k.dtype
77
+ q, k = q.float(), k.float()
78
+ cos, sin = cos.unsqueeze(-2).float(), sin.unsqueeze(-2).float()
79
+ q_embed = (q * cos) + (rotate_half(q) * sin)
80
+ k_embed = (k * cos) + (rotate_half(k) * sin)
81
+ return q_embed.to(orig_q_dtype), k_embed.to(orig_k_dtype)
82
+
83
+
84
+ def apply_multimodal_rotary_pos_emb(
85
+ q: torch.Tensor,
86
+ k: torch.Tensor,
87
+ cos: torch.Tensor,
88
+ sin: torch.Tensor,
89
+ mrope_section: List[int],
90
+ unsqueeze_dim: int = 1,
91
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
92
+ """
93
+ Apply multimodal rotary position embedding (M-RoPE) for 3D position encoding.
94
+
95
+ Args:
96
+ q: Query tensor of shape (batch, heads, seq_len, head_dim).
97
+ k: Key tensor of shape (batch, heads, seq_len, head_dim).
98
+ cos: Cosine tensor of shape (3, batch, seq_len, head_dim).
99
+ sin: Sine tensor of shape (3, batch, seq_len, head_dim).
100
+ mrope_section: List of 3 ints defining section sizes [temporal, height, width].
101
+ For example, [16, 24, 24] for head_dim=128.
102
+ unsqueeze_dim: Dimension to unsqueeze for broadcasting.
103
+
104
+ Returns:
105
+ Tuple of (rotated_q, rotated_k) with M-RoPE applied.
106
+ """
107
+ # mrope_section is like [16, 24, 24] for head_dim=128
108
+ # Multiply by 2 because head_dim is full (not half like in standard RoPE)
109
+ mrope_section = mrope_section * 2 # [16, 24, 24] -> [32, 48, 48]
110
+
111
+ # Split cos/sin along head_dim, then select appropriate dimension (0, 1, 2) for each section
112
+ # cos/sin shape: (3, batch, seq_len, head_dim)
113
+ cos = torch.cat(
114
+ [m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1
115
+ ).unsqueeze(unsqueeze_dim)
116
+ sin = torch.cat(
117
+ [m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1
118
+ ).unsqueeze(unsqueeze_dim)
119
+
120
+ q_embed = (q * cos) + (rotate_half(q) * sin)
121
+ k_embed = (k * cos) + (rotate_half(k) * sin)
122
+ return q_embed, k_embed
123
+
124
+ @use_kernel_forward_from_hub("RMSNorm")
125
+ class DiffusionVL_Qwen2_5_VL_RMSNorm(nn.Module):
126
+ """RMSNorm implementation matching Qwen2RMSNorm from modeling_qwen2.py"""
127
+ def __init__(self, hidden_size, eps=1e-6):
128
+ super().__init__()
129
+ self.weight = nn.Parameter(torch.ones(hidden_size))
130
+ self.variance_epsilon = eps
131
+
132
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
133
+ input_dtype = hidden_states.dtype
134
+ hidden_states = hidden_states.to(torch.float32)
135
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
136
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
137
+ return self.weight * hidden_states.to(input_dtype)
138
+
139
+ def extra_repr(self):
140
+ return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
141
+
142
+
143
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
144
+ """
145
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
146
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
147
+ """
148
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
149
+ if n_rep == 1:
150
+ return hidden_states
151
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
152
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
153
+
154
+
155
+ def eager_attention_forward(
156
+ module: nn.Module,
157
+ query: torch.Tensor,
158
+ key: torch.Tensor,
159
+ value: torch.Tensor,
160
+ attention_mask: Optional[torch.Tensor],
161
+ scaling: float,
162
+ dropout: float = 0.0,
163
+ **kwargs,
164
+ ):
165
+ """Eager attention implementation."""
166
+ key_states = repeat_kv(key, module.num_key_value_groups)
167
+ value_states = repeat_kv(value, module.num_key_value_groups)
168
+
169
+ attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
170
+ if attention_mask is not None:
171
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
172
+ attn_weights = attn_weights + causal_mask
173
+
174
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
175
+ attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
176
+ attn_output = torch.matmul(attn_weights, value_states)
177
+ attn_output = attn_output.transpose(1, 2).contiguous()
178
+
179
+ return attn_output, attn_weights
180
+
181
+
182
+ class DiffusionVL_Qwen2_5_VL_VisionMLP(nn.Module):
183
+ def __init__(self, config, bias: bool = False):
184
+ super().__init__()
185
+ self.hidden_size = config.hidden_size
186
+ self.intermediate_size = config.intermediate_size
187
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=bias)
188
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=bias)
189
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=bias)
190
+ self.act_fn = ACT2FN[config.hidden_act]
191
+
192
+ def forward(self, hidden_state):
193
+ return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))
194
+
195
+
196
+ class DiffusionVL_Qwen2_5_VL_VisionPatchEmbed(nn.Module):
197
+ def __init__(self, patch_size=14, temporal_patch_size=2, in_channels=3, embed_dim=1152):
198
+ super().__init__()
199
+ self.patch_size = patch_size
200
+ self.temporal_patch_size = temporal_patch_size
201
+ self.in_channels = in_channels
202
+ self.embed_dim = embed_dim
203
+ kernel_size = [temporal_patch_size, patch_size, patch_size]
204
+ self.proj = nn.Conv3d(in_channels, embed_dim, kernel_size=kernel_size, stride=kernel_size, bias=False)
205
+
206
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
207
+ target_dtype = self.proj.weight.dtype
208
+ hidden_states = hidden_states.view(
209
+ -1, self.in_channels, self.temporal_patch_size, self.patch_size, self.patch_size
210
+ )
211
+ hidden_states = self.proj(hidden_states.to(dtype=target_dtype)).view(-1, self.embed_dim)
212
+ return hidden_states
213
+
214
+
215
+ class DiffusionVL_Qwen2_5_VL_VisionRotaryEmbedding(nn.Module):
216
+ inv_freq: torch.Tensor
217
+
218
+ def __init__(self, dim: int, theta: float = 10000.0):
219
+ super().__init__()
220
+ inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.float) / dim))
221
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
222
+
223
+ def forward(self, seqlen: int) -> torch.Tensor:
224
+ seq = torch.arange(seqlen, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
225
+ freqs = torch.outer(seq, self.inv_freq)
226
+ return freqs
227
+
228
+
229
+ class DiffusionVL_Qwen2_5_VL_VisionPatchMerger(nn.Module):
230
+ def __init__(self, dim: int, context_dim: int, spatial_merge_size: int = 2):
231
+ super().__init__()
232
+ self.hidden_size = context_dim * (spatial_merge_size ** 2)
233
+ self.ln_q = DiffusionVL_Qwen2_5_VL_RMSNorm(context_dim, eps=1e-6)
234
+ self.mlp = nn.Sequential(
235
+ nn.Linear(self.hidden_size, self.hidden_size),
236
+ nn.GELU(),
237
+ nn.Linear(self.hidden_size, dim),
238
+ )
239
+
240
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
241
+ x = self.mlp(self.ln_q(x).view(-1, self.hidden_size))
242
+ return x
243
+
244
+
245
+ class DiffusionVL_Qwen2_5_VL_VisionAttention(nn.Module):
246
+ def __init__(self, config: DiffusionVL_Qwen2_5_VL_VisionConfig) -> None:
247
+ super().__init__()
248
+ self.dim = config.hidden_size
249
+ self.num_heads = config.num_heads
250
+ self.head_dim = self.dim // self.num_heads
251
+ self.num_key_value_groups = 1 # needed for eager attention
252
+ self.qkv = nn.Linear(self.dim, self.dim * 3, bias=True)
253
+ self.proj = nn.Linear(self.dim, self.dim)
254
+ self.scaling = self.head_dim**-0.5
255
+ self.config = config
256
+ self.attention_dropout = 0.0
257
+ self.is_causal = False
258
+
259
+ def forward(
260
+ self,
261
+ hidden_states: torch.Tensor,
262
+ cu_seqlens: torch.Tensor,
263
+ rotary_pos_emb: Optional[torch.Tensor] = None,
264
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
265
+ **kwargs,
266
+ ) -> torch.Tensor:
267
+ seq_length = hidden_states.shape[0]
268
+ query_states, key_states, value_states = (
269
+ self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
270
+ )
271
+ cos, sin = position_embeddings
272
+ query_states, key_states = apply_rotary_pos_emb_vision(query_states, key_states, cos, sin)
273
+
274
+ query_states = query_states.transpose(0, 1).unsqueeze(0)
275
+ key_states = key_states.transpose(0, 1).unsqueeze(0)
276
+ value_states = value_states.transpose(0, 1).unsqueeze(0)
277
+
278
+ attention_interface: Callable = eager_attention_forward
279
+ if getattr(self.config, "_attn_implementation", "eager") != "eager":
280
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
281
+
282
+ if getattr(self.config, "_attn_implementation", "eager") == "flash_attention_2":
283
+ # Flash Attention 2: Use cu_seqlens for variable length attention
284
+ max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max()
285
+ attn_output, _ = attention_interface(
286
+ self,
287
+ query_states,
288
+ key_states,
289
+ value_states,
290
+ attention_mask=None,
291
+ scaling=self.scaling,
292
+ dropout=0.0 if not self.training else self.attention_dropout,
293
+ cu_seq_lens_q=cu_seqlens,
294
+ cu_seq_lens_k=cu_seqlens,
295
+ max_length_q=max_seqlen,
296
+ max_length_k=max_seqlen,
297
+ is_causal=False,
298
+ **kwargs,
299
+ )
300
+ else:
301
+ # Other implementations: Process each chunk separately
302
+ lengths = cu_seqlens[1:] - cu_seqlens[:-1]
303
+ splits = [
304
+ torch.split(tensor, lengths.tolist(), dim=2) for tensor in (query_states, key_states, value_states)
305
+ ]
306
+
307
+ attn_outputs = [
308
+ attention_interface(
309
+ self,
310
+ q,
311
+ k,
312
+ v,
313
+ attention_mask=None,
314
+ scaling=self.scaling,
315
+ dropout=0.0 if not self.training else self.attention_dropout,
316
+ is_causal=False,
317
+ **kwargs,
318
+ )[0]
319
+ for q, k, v in zip(*splits)
320
+ ]
321
+ attn_output = torch.cat(attn_outputs, dim=1)
322
+
323
+ attn_output = attn_output.reshape(seq_length, -1).contiguous()
324
+ attn_output = self.proj(attn_output)
325
+ return attn_output
326
+
327
+
328
+ class DiffusionVL_Qwen2_5_VL_VisionBlock(GradientCheckpointingLayer):
329
+ def __init__(self, config, attn_implementation: str = "sdpa") -> None:
330
+ super().__init__()
331
+ self.norm1 = DiffusionVL_Qwen2_5_VL_RMSNorm(config.hidden_size, eps=1e-6)
332
+ self.norm2 = DiffusionVL_Qwen2_5_VL_RMSNorm(config.hidden_size, eps=1e-6)
333
+ self.attn = DiffusionVL_Qwen2_5_VL_VisionAttention(config=config)
334
+ self.mlp = DiffusionVL_Qwen2_5_VL_VisionMLP(config, bias=True)
335
+
336
+ def forward(
337
+ self,
338
+ hidden_states: torch.Tensor,
339
+ cu_seqlens: torch.Tensor,
340
+ rotary_pos_emb: Optional[torch.Tensor] = None,
341
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
342
+ **kwargs,
343
+ ) -> torch.Tensor:
344
+ hidden_states = hidden_states + self.attn(
345
+ self.norm1(hidden_states),
346
+ cu_seqlens=cu_seqlens,
347
+ rotary_pos_emb=rotary_pos_emb,
348
+ position_embeddings=position_embeddings,
349
+ **kwargs,
350
+ )
351
+ hidden_states = hidden_states + self.mlp(self.norm2(hidden_states))
352
+ return hidden_states
353
+
354
+
355
+ class DiffusionVL_Qwen2_5_VL_VisionPreTrainedModel(PreTrainedModel):
356
+ config_class = DiffusionVL_Qwen2_5_VL_VisionConfig
357
+ base_model_prefix = "model"
358
+ supports_gradient_checkpointing = True
359
+ _no_split_modules = ["DiffusionVL_Qwen2_5_VL_VisionBlock"]
360
+ _supports_flash_attn_2 = True
361
+ _supports_sdpa = True
362
+ _supports_attention_backend = True
363
+
364
+
365
+ class DiffusionVL_Qwen2_5_VL_VisionTransformer(DiffusionVL_Qwen2_5_VL_VisionPreTrainedModel):
366
+ config_class = DiffusionVL_Qwen2_5_VL_VisionConfig
367
+ _no_split_modules = ["DiffusionVL_Qwen2_5_VL_VisionBlock"]
368
+
369
+ def __init__(self, config: DiffusionVL_Qwen2_5_VL_VisionConfig, *inputs, **kwargs) -> None:
370
+ super().__init__(config, *inputs, **kwargs)
371
+ self.spatial_merge_size = config.spatial_merge_size
372
+ self.patch_size = config.patch_size
373
+ self.fullatt_block_indexes = config.fullatt_block_indexes
374
+ self.window_size = config.window_size
375
+ self.spatial_merge_unit = self.spatial_merge_size * self.spatial_merge_size
376
+
377
+ self.patch_embed = DiffusionVL_Qwen2_5_VL_VisionPatchEmbed(
378
+ patch_size=config.patch_size,
379
+ temporal_patch_size=config.temporal_patch_size,
380
+ in_channels=config.in_channels,
381
+ embed_dim=config.hidden_size,
382
+ )
383
+
384
+ head_dim = config.hidden_size // config.num_heads
385
+ self.rotary_pos_emb = DiffusionVL_Qwen2_5_VL_VisionRotaryEmbedding(head_dim // 2)
386
+
387
+ self.blocks = nn.ModuleList([DiffusionVL_Qwen2_5_VL_VisionBlock(config) for _ in range(config.depth)])
388
+ self.gradient_checkpointing = False
389
+
390
+ def rot_pos_emb(self, grid_thw: torch.Tensor) -> torch.Tensor:
391
+
392
+ pos_ids = []
393
+ for t, h, w in grid_thw:
394
+ hpos_ids = torch.arange(h).unsqueeze(1).expand(-1, w)
395
+ hpos_ids = hpos_ids.reshape(
396
+ h // self.spatial_merge_size,
397
+ self.spatial_merge_size,
398
+ w // self.spatial_merge_size,
399
+ self.spatial_merge_size,
400
+ )
401
+ hpos_ids = hpos_ids.permute(0, 2, 1, 3).flatten()
402
+
403
+ wpos_ids = torch.arange(w).unsqueeze(0).expand(h, -1)
404
+ wpos_ids = wpos_ids.reshape(
405
+ h // self.spatial_merge_size,
406
+ self.spatial_merge_size,
407
+ w // self.spatial_merge_size,
408
+ self.spatial_merge_size,
409
+ )
410
+ wpos_ids = wpos_ids.permute(0, 2, 1, 3).flatten()
411
+ pos_ids.append(torch.stack([hpos_ids, wpos_ids], dim=-1).repeat(t, 1))
412
+ pos_ids = torch.cat(pos_ids, dim=0)
413
+ max_grid_size = grid_thw[:, 1:].max()
414
+ rotary_pos_emb_full = self.rotary_pos_emb(max_grid_size)
415
+ rotary_pos_emb = rotary_pos_emb_full[pos_ids].flatten(1)
416
+ return rotary_pos_emb
417
+
418
+ def get_window_index(self, grid_thw: torch.Tensor):
419
+
420
+ window_index: list = []
421
+ cu_window_seqlens: list = [0]
422
+ window_index_id = 0
423
+ vit_merger_window_size = self.window_size // self.spatial_merge_size // self.patch_size
424
+
425
+ for grid_t, grid_h, grid_w in grid_thw:
426
+ llm_grid_h = grid_h // self.spatial_merge_size
427
+ llm_grid_w = grid_w // self.spatial_merge_size
428
+ index = torch.arange(grid_t * llm_grid_h * llm_grid_w).reshape(grid_t, llm_grid_h, llm_grid_w)
429
+ pad_h = vit_merger_window_size - llm_grid_h % vit_merger_window_size
430
+ pad_w = vit_merger_window_size - llm_grid_w % vit_merger_window_size
431
+ num_windows_h = (llm_grid_h + pad_h) // vit_merger_window_size
432
+ num_windows_w = (llm_grid_w + pad_w) // vit_merger_window_size
433
+ index_padded = F.pad(index, (0, pad_w, 0, pad_h), "constant", -100)
434
+ index_padded = index_padded.reshape(
435
+ grid_t,
436
+ num_windows_h,
437
+ vit_merger_window_size,
438
+ num_windows_w,
439
+ vit_merger_window_size,
440
+ )
441
+ index_padded = index_padded.permute(0, 1, 3, 2, 4).reshape(
442
+ grid_t,
443
+ num_windows_h * num_windows_w,
444
+ vit_merger_window_size,
445
+ vit_merger_window_size,
446
+ )
447
+ seqlens = (index_padded != -100).sum([2, 3]).reshape(-1)
448
+ index_padded = index_padded.reshape(-1)
449
+ index_new = index_padded[index_padded != -100]
450
+ window_index.append(index_new + window_index_id)
451
+ cu_seqlens_tmp = seqlens.cumsum(0) * self.spatial_merge_unit + cu_window_seqlens[-1]
452
+ cu_window_seqlens.extend(cu_seqlens_tmp.tolist())
453
+ window_index_id += (grid_t * llm_grid_h * llm_grid_w).item()
454
+ window_index = torch.cat(window_index, dim=0)
455
+ return window_index, cu_window_seqlens
456
+
457
+ def forward(self, hidden_states: torch.Tensor, grid_thw: torch.Tensor, **kwargs):
458
+
459
+ hidden_states = self.patch_embed(hidden_states)
460
+ rotary_pos_emb = self.rot_pos_emb(grid_thw)
461
+ window_index, cu_window_seqlens = self.get_window_index(grid_thw)
462
+ cu_window_seqlens = torch.tensor(
463
+ cu_window_seqlens,
464
+ device=hidden_states.device,
465
+ dtype=grid_thw.dtype if torch.jit.is_tracing() else torch.int32,
466
+ )
467
+ cu_window_seqlens = torch.unique_consecutive(cu_window_seqlens)
468
+
469
+ seq_len, _ = hidden_states.size()
470
+ hidden_states = hidden_states.reshape(seq_len // self.spatial_merge_unit, self.spatial_merge_unit, -1)
471
+ hidden_states = hidden_states[window_index, :, :]
472
+ hidden_states = hidden_states.reshape(seq_len, -1)
473
+ rotary_pos_emb = rotary_pos_emb.reshape(seq_len // self.spatial_merge_unit, self.spatial_merge_unit, -1)
474
+ rotary_pos_emb = rotary_pos_emb[window_index, :, :]
475
+ rotary_pos_emb = rotary_pos_emb.reshape(seq_len, -1)
476
+ emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1)
477
+ position_embeddings = (emb.cos(), emb.sin())
478
+
479
+ cu_seqlens = torch.repeat_interleave(grid_thw[:, 1] * grid_thw[:, 2], grid_thw[:, 0]).cumsum(
480
+ dim=0,
481
+ # Select dtype based on the following factors:
482
+ # - FA2 requires that cu_seqlens_q must have dtype int32
483
+ # - torch.onnx.export requires that cu_seqlens_q must have same dtype as grid_thw
484
+ # See https://github.com/huggingface/transformers/pull/34852 for more information
485
+ dtype=grid_thw.dtype if torch.jit.is_tracing() else torch.int32,
486
+ )
487
+ cu_seqlens = F.pad(cu_seqlens, (1, 0), value=0)
488
+
489
+ for layer_num, blk in enumerate(self.blocks):
490
+ if layer_num in self.fullatt_block_indexes:
491
+ cu_seqlens_now = cu_seqlens
492
+ else:
493
+ cu_seqlens_now = cu_window_seqlens
494
+
495
+ hidden_states = blk(
496
+ hidden_states,
497
+ cu_seqlens=cu_seqlens_now,
498
+ position_embeddings=position_embeddings,
499
+ **kwargs,
500
+ )
501
+
502
+ # Return hidden_states AND window_index for MMProjector to apply merger and reverse shuffle
503
+ return hidden_states, window_index
504
+
505
+
506
+ class DiffusionVL_Qwen2_5_VL_VisionTower(nn.Module):
507
+
508
+ def __init__(self, config: DiffusionVL_Qwen2_5_VL_VisionConfig):
509
+ super().__init__()
510
+ self.vision_tower = DiffusionVL_Qwen2_5_VL_VisionTransformer(config)
511
+ self.spatial_merge_size = config.spatial_merge_size
512
+
513
+ def forward(self, hidden_states: torch.Tensor, grid_thw: torch.Tensor = None):
514
+ """Returns (hidden_states, window_index) tuple for MMProjector."""
515
+ return self.vision_tower(hidden_states, grid_thw)
516
+
517
+
518
+ class DiffusionVL_Qwen2_5_VL_MMProjector(nn.Module):
519
+
520
+ def __init__(self, config: DiffusionVL_Qwen2_5_VL_VisionConfig):
521
+ super().__init__()
522
+ self.merger = DiffusionVL_Qwen2_5_VL_VisionPatchMerger(
523
+ dim=config.out_hidden_size,
524
+ context_dim=config.hidden_size,
525
+ spatial_merge_size=config.spatial_merge_size,
526
+ )
527
+
528
+ def forward(self, features_tuple):
529
+ """Forward pass with merger and window index reversal."""
530
+ if isinstance(features_tuple, tuple):
531
+ hidden_states, window_index = features_tuple
532
+ # Apply merger
533
+ projected_features = self.merger(hidden_states)
534
+ # Reverse the window shuffle to restore original spatial order
535
+ reverse_indices = torch.argsort(window_index)
536
+ final_features = projected_features[reverse_indices, :]
537
+ return final_features
538
+ else:
539
+ # Fallback for simple tensor input
540
+ return self.merger(features_tuple)
541
+
542
+ class DiffusionVL_Qwen2_5_VL_RotaryEmbedding(nn.Module):
543
+
544
+ def __init__(self, config):
545
+ super().__init__()
546
+ self.config = config
547
+ dim = config.hidden_size // config.num_attention_heads
548
+ inv_freq = 1.0 / (config.rope_theta ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
549
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
550
+
551
+ def forward(self, x, position_ids):
552
+ """
553
+ Args:
554
+ x: Input tensor for dtype reference
555
+ position_ids: Position IDs with shape (3, batch_size, seq_length) for M-RoPE
556
+ or (batch_size, seq_length) for standard RoPE (will be converted to 3D)
557
+
558
+ Returns:
559
+ cos, sin: Tensors of shape (3, batch, seq_len, head_dim) for M-RoPE
560
+ """
561
+ # Always convert 2D position_ids to 3D for M-RoPE
562
+ if position_ids.ndim == 2:
563
+ # (batch, seq) -> (3, batch, seq)
564
+ position_ids = position_ids.unsqueeze(0).expand(3, -1, -1)
565
+
566
+ # Now position_ids should be 3D: (3, batch_size, seq_length)
567
+ if position_ids.ndim == 3 and position_ids.shape[0] == 3:
568
+ # M-RoPE: position_ids shape is (3, batch_size, seq_length)
569
+ # Expand inv_freq to (3, batch_size, head_dim//2, 1)
570
+ inv_freq_expanded = self.inv_freq[None, None, :, None].float().expand(
571
+ 3, position_ids.shape[1], -1, 1
572
+ )
573
+ # position_ids_expanded shape: (3, batch_size, 1, seq_length)
574
+ position_ids_expanded = position_ids[:, :, None, :].float()
575
+
576
+ device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
577
+ with torch.autocast(device_type=device_type, enabled=False):
578
+ # freqs shape: (3, batch_size, seq_length, head_dim//2)
579
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(2, 3)
580
+ # emb shape: (3, batch_size, seq_length, head_dim)
581
+ emb = torch.cat((freqs, freqs), dim=-1)
582
+ cos = emb.cos()
583
+ sin = emb.sin()
584
+
585
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
586
+ else:
587
+ # Standard 1D RoPE (fallback)
588
+ inv_freq_expanded = self.inv_freq[None, :, None].expand(position_ids.shape[0], -1, 1)
589
+ position_ids_expanded = position_ids[:, None, :].float()
590
+ freqs = (inv_freq_expanded @ position_ids_expanded).transpose(1, 2)
591
+ emb = torch.cat((freqs, freqs), dim=-1)
592
+ cos = emb.cos()
593
+ sin = emb.sin()
594
+ return cos.to(x.dtype), sin.to(x.dtype)
595
+
596
+
597
+ class DiffusionVL_Qwen2_5_VL_MLP(nn.Module):
598
+ def __init__(self, config, bias: bool = False):
599
+ super().__init__()
600
+ self.hidden_size = config.hidden_size
601
+ self.intermediate_size = config.intermediate_size
602
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=bias)
603
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=bias)
604
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=bias)
605
+ self.act_fn = ACT2FN[config.hidden_act]
606
+
607
+ def forward(self, hidden_state):
608
+ return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))
609
+
610
+
611
+ class DiffusionVL_Qwen2_5_VL_Attention(nn.Module):
612
+ """Non-causal attention for diffusion-based generation with KV-cache support."""
613
+
614
+ def __init__(self, config, layer_idx):
615
+ super().__init__()
616
+ self.config = config
617
+ self.layer_idx = layer_idx
618
+ self.hidden_size = config.hidden_size
619
+ self.num_heads = config.num_attention_heads
620
+ self.head_dim = self.hidden_size // self.num_heads
621
+ self.num_key_value_heads = config.num_key_value_heads
622
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
623
+ self.scaling = self.head_dim ** -0.5
624
+
625
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True)
626
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
627
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
628
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
629
+
630
+ # Non-causal for diffusion
631
+ self.is_causal = False
632
+
633
+ def forward(
634
+ self,
635
+ hidden_states,
636
+ attention_mask=None,
637
+ position_ids=None,
638
+ past_key_values=None,
639
+ output_attentions=False,
640
+ use_cache=False,
641
+ cache_position=None,
642
+ position_embeddings=None,
643
+ store_kv=False,
644
+ **kwargs,
645
+ ):
646
+ bsz, q_len, _ = hidden_states.size()
647
+
648
+ query_states = self.q_proj(hidden_states)
649
+ key_states = self.k_proj(hidden_states)
650
+ value_states = self.v_proj(hidden_states)
651
+
652
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
653
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
654
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
655
+
656
+ if position_embeddings is not None:
657
+ cos, sin = position_embeddings
658
+ query_states, key_states = apply_multimodal_rotary_pos_emb(
659
+ query_states, key_states, cos, sin,
660
+ self.config.rope_scaling.get("mrope_section", [16, 24, 24])
661
+ )
662
+
663
+ # KV cache handling with store_kv support
664
+ if past_key_values is not None and use_cache:
665
+ cache_kwargs = {"cache_position": cache_position}
666
+ if store_kv:
667
+ # Store current KV to cache (for prefill or final step)
668
+ key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
669
+ else:
670
+ # Use cached KV but don't update (for diffusion steps within a block)
671
+ cached_key = past_key_values.key_cache[self.layer_idx] if self.layer_idx < len(past_key_values.key_cache) else None
672
+ cached_value = past_key_values.value_cache[self.layer_idx] if self.layer_idx < len(past_key_values.value_cache) else None
673
+ if cached_key is not None and cached_value is not None:
674
+ key_states = torch.cat([cached_key, key_states], dim=2)
675
+ value_states = torch.cat([cached_value, value_states], dim=2)
676
+
677
+ # Repeat KV for GQA
678
+ key_states = key_states.repeat_interleave(self.num_key_value_groups, dim=1)
679
+ value_states = value_states.repeat_interleave(self.num_key_value_groups, dim=1)
680
+
681
+ # Handle dict-format attention_mask (for BD3LM compatibility)
682
+ if attention_mask is not None:
683
+ if isinstance(attention_mask, dict):
684
+ # Use full_attention mask for all layers (simplified)
685
+ attn_mask = attention_mask.get("full_attention", None)
686
+ else:
687
+ attn_mask = attention_mask
688
+ else:
689
+ attn_mask = None
690
+
691
+ if attn_mask is not None:
692
+ attn_output = F.scaled_dot_product_attention(
693
+ query_states,
694
+ key_states,
695
+ value_states,
696
+ attn_mask=attn_mask,
697
+ dropout_p=0.0,
698
+ is_causal=False,
699
+ scale=self.scaling,
700
+ )
701
+ else:
702
+ attn_output = F.scaled_dot_product_attention(
703
+ query_states,
704
+ key_states,
705
+ value_states,
706
+ dropout_p=0.0,
707
+ is_causal=False,
708
+ scale=self.scaling,
709
+ )
710
+
711
+ attn_output = attn_output.transpose(1, 2).reshape(bsz, q_len, -1)
712
+ attn_output = self.o_proj(attn_output)
713
+
714
+ return attn_output, None
715
+
716
+
717
+ class DiffusionVL_Qwen2_5_VL_DecoderLayer(nn.Module):
718
+ def __init__(self, config, layer_idx):
719
+ super().__init__()
720
+ self.hidden_size = config.hidden_size
721
+ self.self_attn = DiffusionVL_Qwen2_5_VL_Attention(config, layer_idx)
722
+ self.mlp = DiffusionVL_Qwen2_5_VL_MLP(config)
723
+ self.input_layernorm = DiffusionVL_Qwen2_5_VL_RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
724
+ self.post_attention_layernorm = DiffusionVL_Qwen2_5_VL_RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
725
+
726
+ def forward(
727
+ self,
728
+ hidden_states,
729
+ attention_mask=None,
730
+ position_ids=None,
731
+ past_key_values=None,
732
+ output_attentions=False,
733
+ use_cache=False,
734
+ cache_position=None,
735
+ position_embeddings=None,
736
+ store_kv=False,
737
+ **kwargs,
738
+ ):
739
+ residual = hidden_states
740
+ hidden_states = self.input_layernorm(hidden_states)
741
+
742
+ hidden_states, attn_weights = self.self_attn(
743
+ hidden_states=hidden_states,
744
+ attention_mask=attention_mask,
745
+ position_ids=position_ids,
746
+ past_key_values=past_key_values,
747
+ output_attentions=output_attentions,
748
+ use_cache=use_cache,
749
+ cache_position=cache_position,
750
+ position_embeddings=position_embeddings,
751
+ store_kv=store_kv,
752
+ **kwargs,
753
+ )
754
+ hidden_states = residual + hidden_states
755
+
756
+ residual = hidden_states
757
+ hidden_states = self.post_attention_layernorm(hidden_states)
758
+ hidden_states = self.mlp(hidden_states)
759
+ hidden_states = residual + hidden_states
760
+
761
+ return hidden_states, attn_weights
762
+
763
+ class DiffusionVL_Qwen2_5_VL_PreTrainedModel(PreTrainedModel):
764
+
765
+ config_class = DiffusionVL_Qwen2_5_VL_Config
766
+ base_model_prefix = "model"
767
+ input_modalities = ["image", "video", "text"]
768
+ supports_gradient_checkpointing = True
769
+ _no_split_modules = ["DiffusionVL_Qwen2_5_VL_DecoderLayer", "DiffusionVL_Qwen2_5_VL_VisionBlock"]
770
+ _skip_keys_device_placement = "past_key_values"
771
+ _supports_flash_attn = True
772
+ _supports_sdpa = True
773
+
774
+ _can_compile_fullgraph = True
775
+ _supports_attention_backend = True
776
+
777
+
778
+ class DiffusionVL_Qwen2_5_VL_Model(DiffusionVL_Qwen2_5_VL_PreTrainedModel):
779
+
780
+ def __init__(self, config: DiffusionVL_Qwen2_5_VL_Config):
781
+ super().__init__(config)
782
+ self.config = config
783
+
784
+ # Vision components (matching weight keys)
785
+ self.vision_tower = DiffusionVL_Qwen2_5_VL_VisionTower(config.vision_config)
786
+ self.mm_projector = DiffusionVL_Qwen2_5_VL_MMProjector(config.vision_config)
787
+
788
+ # Text components
789
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
790
+ self.layers = nn.ModuleList([
791
+ DiffusionVL_Qwen2_5_VL_DecoderLayer(config, layer_idx)
792
+ for layer_idx in range(config.num_hidden_layers)
793
+ ])
794
+ self.norm = DiffusionVL_Qwen2_5_VL_RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
795
+ self.rotary_emb = DiffusionVL_Qwen2_5_VL_RotaryEmbedding(config)
796
+
797
+ # BD3LM block size
798
+ self.bd3lm_block_size = config.bd3lm_block_size
799
+
800
+ self.post_init()
801
+
802
+ def get_input_embeddings(self):
803
+ return self.embed_tokens
804
+
805
+ def set_input_embeddings(self, value):
806
+ self.embed_tokens = value
807
+
808
+ def get_image_features(self, pixel_values: torch.FloatTensor, image_grid_thw: Optional[torch.LongTensor] = None):
809
+ """
810
+ Encodes images into continuous embeddings through vision tower and mm_projector.
811
+
812
+ Args:
813
+ pixel_values: Image tensor
814
+ image_grid_thw: Grid dimensions (temporal, height, width) for each image
815
+
816
+ Returns:
817
+ Image embeddings ready to be merged with text embeddings
818
+ """
819
+ pixel_values = pixel_values.to(dtype=self.vision_tower.vision_tower.patch_embed.proj.weight.dtype)
820
+ hidden_states = self.vision_tower(pixel_values, image_grid_thw)
821
+ image_embeds = self.mm_projector(hidden_states)
822
+ return image_embeds
823
+
824
+ def forward(
825
+ self,
826
+ input_ids=None,
827
+ attention_mask=None,
828
+ position_ids=None,
829
+ past_key_values=None,
830
+ inputs_embeds=None,
831
+ use_cache=None,
832
+ output_attentions=None,
833
+ output_hidden_states=None,
834
+ return_dict=None,
835
+ cache_position=None,
836
+ store_kv=False,
837
+ pixel_values=None,
838
+ image_grid_thw=None,
839
+ **kwargs,
840
+ ):
841
+ """Forward pass with optional vision input processing."""
842
+ output_attentions = output_attentions or False
843
+ output_hidden_states = output_hidden_states or False
844
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
845
+ return_dict = return_dict if return_dict is not None else True
846
+
847
+ IMAGE_TOKEN_INDEX = -200
848
+
849
+ if inputs_embeds is None:
850
+ inputs_embeds = self.embed_tokens(input_ids)
851
+
852
+ if pixel_values is not None and image_grid_thw is not None:
853
+ # Get image features
854
+ image_features = self.get_image_features(pixel_values, image_grid_thw)
855
+
856
+ # Split features per image
857
+ spatial_merge_size = self.vision_tower.spatial_merge_size
858
+ split_sizes = (image_grid_thw.prod(dim=1) // (spatial_merge_size ** 2)).tolist()
859
+ image_features_list = list(torch.split(image_features, split_sizes))
860
+
861
+ # Replace IMAGE_TOKEN positions with image features
862
+ batch_size = input_ids.shape[0] if input_ids is not None else inputs_embeds.shape[0]
863
+ new_inputs_embeds_list = []
864
+
865
+ for batch_idx in range(batch_size):
866
+ cur_input_ids = input_ids[batch_idx] if input_ids is not None else None
867
+ cur_embeds = inputs_embeds[batch_idx]
868
+
869
+ if cur_input_ids is None or (cur_input_ids == IMAGE_TOKEN_INDEX).sum() == 0:
870
+ new_inputs_embeds_list.append(cur_embeds)
871
+ continue
872
+
873
+ # Find IMAGE_TOKEN positions
874
+ image_positions = torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist()
875
+ image_token_indices = [-1] + image_positions + [len(cur_input_ids)]
876
+
877
+ # Split embeddings and interleave with image features
878
+ cur_new_embeds = []
879
+ cur_image_idx = 0
880
+
881
+ for i in range(len(image_token_indices) - 1):
882
+ start = image_token_indices[i] + 1
883
+ end = image_token_indices[i + 1]
884
+
885
+ # Add text segment
886
+ if start < end:
887
+ cur_new_embeds.append(cur_embeds[start:end])
888
+
889
+ # Add image features (before the next segment, except after last)
890
+ if i < len(image_positions) and cur_image_idx < len(image_features_list):
891
+ cur_new_embeds.append(image_features_list[cur_image_idx].to(cur_embeds.dtype))
892
+ cur_image_idx += 1
893
+
894
+ if cur_new_embeds:
895
+ new_inputs_embeds_list.append(torch.cat(cur_new_embeds, dim=0))
896
+ else:
897
+ new_inputs_embeds_list.append(cur_embeds)
898
+
899
+ # Pad and stack
900
+ max_len = max(x.shape[0] for x in new_inputs_embeds_list)
901
+ hidden_size = new_inputs_embeds_list[0].shape[-1]
902
+ inputs_embeds = torch.zeros(
903
+ batch_size, max_len, hidden_size,
904
+ dtype=new_inputs_embeds_list[0].dtype,
905
+ device=new_inputs_embeds_list[0].device
906
+ )
907
+ for i, embed in enumerate(new_inputs_embeds_list):
908
+ inputs_embeds[i, :embed.shape[0]] = embed
909
+
910
+ batch_size, seq_length = inputs_embeds.shape[:2]
911
+
912
+ if cache_position is None:
913
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
914
+ cache_position = torch.arange(past_seen_tokens, past_seen_tokens + seq_length, device=inputs_embeds.device)
915
+
916
+ if position_ids is None:
917
+ # position_ids will be converted to 3D for M-RoPE in rotary_emb
918
+ position_ids = cache_position.unsqueeze(0)
919
+
920
+ # Position embeddings
921
+ position_embeddings = self.rotary_emb(inputs_embeds, position_ids)
922
+
923
+ hidden_states = inputs_embeds
924
+ all_hidden_states = () if output_hidden_states else None
925
+ all_attentions = () if output_attentions else None
926
+
927
+ for layer in self.layers:
928
+ if output_hidden_states:
929
+ all_hidden_states += (hidden_states,)
930
+
931
+ hidden_states, attn_weights = layer(
932
+ hidden_states,
933
+ attention_mask=attention_mask,
934
+ position_ids=position_ids,
935
+ past_key_values=past_key_values,
936
+ output_attentions=output_attentions,
937
+ use_cache=use_cache,
938
+ cache_position=cache_position,
939
+ position_embeddings=position_embeddings,
940
+ store_kv=store_kv,
941
+ )
942
+
943
+ if output_attentions:
944
+ all_attentions += (attn_weights,)
945
+
946
+ hidden_states = self.norm(hidden_states)
947
+
948
+ if output_hidden_states:
949
+ all_hidden_states += (hidden_states,)
950
+
951
+ return BaseModelOutputWithPast(
952
+ last_hidden_state=hidden_states,
953
+ past_key_values=past_key_values,
954
+ hidden_states=all_hidden_states,
955
+ attentions=all_attentions,
956
+ )
957
+
958
+
959
+ class DiffusionVL_Qwen2_5_VL_ForConditionalGeneration(DiffusionVL_Qwen2_5_VL_PreTrainedModel):
960
+ r"""
961
+ DiffusionVL Model with a language modeling head for diffusion-based generation.
962
+
963
+ This model uses block diffusion instead of autoregressive
964
+ generation. The `generate()` method implements the diffusion denoising process.
965
+
966
+ """
967
+
968
+ # Weight tying keys - used when tie_word_embeddings=True
969
+ _tied_weights_keys = ["lm_head.weight"]
970
+
971
+ def __init__(self, config: DiffusionVL_Qwen2_5_VL_Config):
972
+ super().__init__(config)
973
+ self.model = DiffusionVL_Qwen2_5_VL_Model(config)
974
+ self.vocab_size = config.vocab_size
975
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
976
+
977
+ # Diffusion parameters
978
+ self.mask_token_id = config.mask_token_id
979
+ self.block_size = config.bd3lm_block_size
980
+
981
+ self.post_init()
982
+
983
+ def get_model(self):
984
+ return self.model
985
+
986
+ def get_input_embeddings(self):
987
+ return self.model.embed_tokens
988
+
989
+ def set_input_embeddings(self, value):
990
+ self.model.embed_tokens = value
991
+
992
+ def tie_weights(self):
993
+ """Tie weights if config.tie_word_embeddings is True (3B model)."""
994
+ if getattr(self.config, "tie_word_embeddings", False):
995
+ # Call parent's tie_weights to tie lm_head with embed_tokens
996
+ super().tie_weights()
997
+ # else: do nothing, keep separate lm_head weights (7B model)
998
+
999
+ def get_output_embeddings(self):
1000
+ return self.lm_head
1001
+
1002
+ def set_output_embeddings(self, new_embeddings):
1003
+ self.lm_head = new_embeddings
1004
+
1005
+ def forward(
1006
+ self,
1007
+ input_ids=None,
1008
+ attention_mask=None,
1009
+ position_ids=None,
1010
+ past_key_values=None,
1011
+ inputs_embeds=None,
1012
+ labels=None,
1013
+ use_cache=None,
1014
+ output_attentions=None,
1015
+ output_hidden_states=None,
1016
+ return_dict=None,
1017
+ pixel_values=None,
1018
+ image_grid_thw=None,
1019
+ **kwargs,
1020
+ ):
1021
+ return_dict = return_dict if return_dict is not None else True
1022
+
1023
+ # Handle vision inputs if provided
1024
+ if pixel_values is not None and inputs_embeds is None:
1025
+ # Get vision features and merge with text
1026
+ vision_features = self.model.vision_tower(pixel_values, image_grid_thw)
1027
+ inputs_embeds = self._merge_vision_text(input_ids, vision_features)
1028
+ input_ids = None
1029
+
1030
+ outputs = self.model(
1031
+ input_ids=input_ids,
1032
+ attention_mask=attention_mask,
1033
+ position_ids=position_ids,
1034
+ past_key_values=past_key_values,
1035
+ inputs_embeds=inputs_embeds,
1036
+ use_cache=use_cache,
1037
+ output_attentions=output_attentions,
1038
+ output_hidden_states=output_hidden_states,
1039
+ return_dict=True,
1040
+ )
1041
+
1042
+ hidden_states = outputs.last_hidden_state
1043
+ logits = self.lm_head(hidden_states)
1044
+
1045
+ loss = None
1046
+ if labels is not None:
1047
+ shift_logits = logits[..., :-1, :].contiguous()
1048
+ shift_labels = labels[..., 1:].contiguous()
1049
+ loss = F.cross_entropy(
1050
+ shift_logits.view(-1, self.vocab_size),
1051
+ shift_labels.view(-1),
1052
+ ignore_index=-100,
1053
+ )
1054
+
1055
+ return CausalLMOutputWithPast(
1056
+ loss=loss,
1057
+ logits=logits,
1058
+ past_key_values=outputs.past_key_values,
1059
+ hidden_states=outputs.hidden_states,
1060
+ attentions=outputs.attentions,
1061
+ )
1062
+
1063
+ def _merge_vision_text(self, input_ids, vision_features):
1064
+ """Merge vision features with text embeddings."""
1065
+ text_embeds = self.model.embed_tokens(input_ids)
1066
+ # Simple placeholder - full implementation would properly insert vision tokens
1067
+ return text_embeds
1068
+
1069
+ @torch.no_grad()
1070
+ def generate(
1071
+ self,
1072
+ inputs: Optional[torch.Tensor] = None,
1073
+ images: Optional[torch.Tensor] = None,
1074
+ image_sizes: Optional[torch.Tensor] = None,
1075
+ image_grid_thws: Optional[torch.Tensor] = None,
1076
+ modalities: Optional[List] = None,
1077
+ gen_length: int = 256,
1078
+ steps: int = 16,
1079
+ temperature: float = 0.0,
1080
+ confidence_threshold = 0.8,
1081
+ **kwargs,
1082
+ ):
1083
+ """
1084
+ Diffusion-based generation using BD3LM algorithm.
1085
+
1086
+ Follows the same logic as DiffusionVLQwenVLForCausalLM.generate():
1087
+ 1. If images provided, call prepare_inputs_labels_for_multimodal
1088
+ 2. Otherwise, just embed the input tokens
1089
+ 3. Call generate_with_bd3lm
1090
+
1091
+ Args:
1092
+ inputs: Input token IDs (prompt) [batch_size, seq_len]
1093
+ images: Image tensor (pixel_values) for vision inputs
1094
+ image_sizes: Image sizes
1095
+ image_grid_thws: Grid dimensions for vision inputs (num_images, 3)
1096
+ modalities: List of modalities (e.g., ["image"])
1097
+ gen_length: Number of tokens to generate
1098
+ steps: Number of diffusion steps per block
1099
+ temperature: Sampling temperature (0 for greedy)
1100
+
1101
+ Returns:
1102
+ Generated token IDs
1103
+ """
1104
+ if modalities is None:
1105
+ modalities = ["image"]
1106
+
1107
+ if images is not None:
1108
+ inputs_embeds = self.prepare_inputs_labels_for_multimodal(
1109
+ input_ids=inputs,
1110
+ images=images,
1111
+ image_grid_thws=image_grid_thws,
1112
+ )
1113
+ else:
1114
+ inputs_embeds = self.get_input_embeddings()(inputs)
1115
+
1116
+ # Call the BD3LM generation
1117
+ return self.generate_with_bd3lm(
1118
+ inputs_embeds=inputs_embeds,
1119
+ gen_length=gen_length,
1120
+ steps=steps,
1121
+ temperature=temperature,
1122
+ confidence_threshold=confidence_threshold,
1123
+ **kwargs,
1124
+ )
1125
+
1126
+ def prepare_inputs_labels_for_multimodal(
1127
+ self,
1128
+ input_ids: torch.Tensor,
1129
+ images: torch.Tensor,
1130
+ image_grid_thws: Optional[torch.Tensor] = None,
1131
+ ) -> torch.Tensor:
1132
+ """
1133
+ Prepare inputs_embeds by merging text embeddings with image features.
1134
+
1135
+ Uses LLaVA format: IMAGE_TOKEN_INDEX (-200) as placeholder.
1136
+
1137
+ Args:
1138
+ input_ids: Input token IDs with IMAGE_TOKEN_INDEX (-200) as image placeholders
1139
+ images: Pixel values tensor
1140
+ image_grid_thws: Grid dimensions for each image
1141
+
1142
+ Returns:
1143
+ inputs_embeds: Merged text + image embeddings
1144
+ """
1145
+ IMAGE_TOKEN_INDEX = -200
1146
+
1147
+ device = input_ids.device
1148
+ batch_size = input_ids.shape[0]
1149
+
1150
+ # Convert image_grid_thws to tensor if needed
1151
+ if image_grid_thws is not None:
1152
+ if not isinstance(image_grid_thws, torch.Tensor):
1153
+ image_grid_thw = torch.tensor(image_grid_thws, device=device)
1154
+ else:
1155
+ image_grid_thw = image_grid_thws.to(device)
1156
+ else:
1157
+ raise ValueError("image_grid_thws is required for vision processing")
1158
+
1159
+ # 1. Get image features through vision tower + mm_projector
1160
+ image_features = self.model.get_image_features(images, image_grid_thw)
1161
+
1162
+ # 2. Split features per image based on grid_thw
1163
+ spatial_merge_size = self.model.vision_tower.spatial_merge_size
1164
+ split_sizes = (image_grid_thw.prod(dim=1) // (spatial_merge_size ** 2)).tolist()
1165
+ image_features_list = list(torch.split(image_features, split_sizes))
1166
+
1167
+ # 3. Build new input embeddings (LLaVA format)
1168
+ new_input_embeds_list = []
1169
+
1170
+ for batch_idx in range(batch_size):
1171
+ cur_input_ids = input_ids[batch_idx]
1172
+ num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum().item()
1173
+
1174
+ if num_images == 0:
1175
+ # No image tokens, just embed text
1176
+ cur_input_embeds = self.get_input_embeddings()(cur_input_ids)
1177
+ new_input_embeds_list.append(cur_input_embeds)
1178
+ continue
1179
+
1180
+ # LLaVA format: IMAGE_TOKEN_INDEX (-200) as placeholder
1181
+ image_token_indices = [-1] + torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist() + [len(cur_input_ids)]
1182
+
1183
+ cur_input_ids_noim = []
1184
+ for idx in range(len(image_token_indices) - 1):
1185
+ start = image_token_indices[idx] + 1
1186
+ end = image_token_indices[idx + 1]
1187
+ if start < end:
1188
+ cur_input_ids_noim.append(cur_input_ids[start:end])
1189
+
1190
+ if cur_input_ids_noim:
1191
+ cur_input_embeds_noim = self.get_input_embeddings()(torch.cat(cur_input_ids_noim))
1192
+ split_sizes_text = [x.shape[0] for x in cur_input_ids_noim]
1193
+ cur_input_embeds_noim_split = list(torch.split(cur_input_embeds_noim, split_sizes_text))
1194
+ else:
1195
+ cur_input_embeds_noim_split = []
1196
+
1197
+ cur_new_input_embeds = []
1198
+ cur_image_idx = 0
1199
+
1200
+ for idx in range(num_images + 1):
1201
+ if idx < len(cur_input_embeds_noim_split):
1202
+ cur_new_input_embeds.append(cur_input_embeds_noim_split[idx])
1203
+ if idx < num_images and cur_image_idx < len(image_features_list):
1204
+ cur_image_features = image_features_list[cur_image_idx]
1205
+ target_dtype = cur_input_embeds_noim_split[0].dtype if cur_input_embeds_noim_split else images.dtype
1206
+ cur_new_input_embeds.append(cur_image_features.to(target_dtype))
1207
+ cur_image_idx += 1
1208
+
1209
+ if cur_new_input_embeds:
1210
+ # Ensure all tensors are on the same device before cat (multi-GPU support)
1211
+ target_device = cur_new_input_embeds[0].device
1212
+ cur_new_input_embeds = [t.to(target_device) for t in cur_new_input_embeds]
1213
+ cur_new_input_embeds = torch.cat(cur_new_input_embeds, dim=0)
1214
+ else:
1215
+ cur_new_input_embeds = self.get_input_embeddings()(cur_input_ids)
1216
+
1217
+ new_input_embeds_list.append(cur_new_input_embeds)
1218
+
1219
+ # 4. Pad to same length and stack
1220
+ max_len = max(x.shape[0] for x in new_input_embeds_list)
1221
+ hidden_size = new_input_embeds_list[0].shape[-1]
1222
+ dtype = new_input_embeds_list[0].dtype
1223
+
1224
+ inputs_embeds = torch.zeros(batch_size, max_len, hidden_size, dtype=dtype, device=device)
1225
+ for i, embed in enumerate(new_input_embeds_list):
1226
+ inputs_embeds[i, :embed.shape[0]] = embed.to(device)
1227
+
1228
+ return inputs_embeds
1229
+ # 能征程运行
1230
+ # @torch.no_grad()
1231
+ # def generate_with_bd3lm(
1232
+ # self,
1233
+ # inputs_embeds: torch.FloatTensor,
1234
+ # gen_length: int = 256,
1235
+ # steps: int = 8, # kept for API compatibility; not used in dynamic-commit decoding
1236
+ # temperature: float = 0.0,
1237
+ # top_k: int = 0,
1238
+ # top_p: float = 1.0,
1239
+ # remasking_strategy: str = "dynamic_commit", # kept for API compatibility
1240
+ # confidence_threshold: float = 0.85,
1241
+ # min_commit_tokens: int = 1,
1242
+ # **kwargs,
1243
+ # ):
1244
+ # """
1245
+ # Dynamic-commit causal generation with delayed/fused KV-cache update.
1246
+
1247
+ # Difference from the naive version:
1248
+ # Naive:
1249
+ # 1. predict mask window, store_kv=False
1250
+ # 2. commit tokens
1251
+ # 3. forward committed tokens again, store_kv=True
1252
+
1253
+ # This version:
1254
+ # 1. predict first mask window, store_kv=False
1255
+ # 2. commit tokens, but do not immediately cache them
1256
+ # 3. next round: forward [pending committed tokens + new mask window]
1257
+ # 4. store_kv=True, then crop cache to remove the new mask window
1258
+ # 5. logits are taken only from the new mask window positions
1259
+
1260
+ # This fuses:
1261
+ # previous clean-cache forward + next mask-prediction forward
1262
+ # into one model call.
1263
+
1264
+ # Important:
1265
+ # This requires the returned DynamicCache to support crop(max_length).
1266
+ # If not, use the fallback helper below or add crop support to your cache.
1267
+ # """
1268
+ # from transformers.cache_utils import DynamicCache
1269
+
1270
+ # device = inputs_embeds.device
1271
+ # dtype = inputs_embeds.dtype
1272
+ # batch_size = inputs_embeds.shape[0]
1273
+ # prompt_len = inputs_embeds.shape[1]
1274
+
1275
+ # if gen_length <= 0:
1276
+ # return torch.empty((batch_size, 0), dtype=torch.long, device=device)
1277
+
1278
+ # block_size = getattr(self, "block_size", None)
1279
+ # if block_size is None:
1280
+ # block_size = getattr(getattr(self, "model", None), "bd3lm_block_size", None)
1281
+ # if block_size is None:
1282
+ # raise ValueError("Cannot find block_size from self.block_size or self.model.bd3lm_block_size.")
1283
+
1284
+ # block_size = steps
1285
+
1286
+ # mask_id = getattr(self, "mask_token_id", 151671)
1287
+ # eos_token_id = kwargs.get("eos_token_id", 151645)
1288
+
1289
+ # embed_layer = self.get_input_embeddings()
1290
+ # embed_device = embed_layer.weight.device
1291
+
1292
+ # max_total_len = prompt_len + gen_length + block_size
1293
+
1294
+ # position_ids_base = torch.arange(
1295
+ # max_total_len,
1296
+ # device=device,
1297
+ # dtype=torch.long,
1298
+ # ).unsqueeze(0)
1299
+
1300
+ # # Cache mask embedding once.
1301
+ # mask_id_tensor = torch.tensor([[mask_id]], dtype=torch.long, device=embed_device)
1302
+ # mask_embed_vec = embed_layer(mask_id_tensor)
1303
+ # if mask_embed_vec.device != device:
1304
+ # mask_embed_vec = mask_embed_vec.to(device=device, non_blocking=True)
1305
+
1306
+ # # def crop_past_key_values(past_key_values, max_length: int):
1307
+ # # """
1308
+ # # Keep only the first max_length cached tokens.
1309
+
1310
+ # # This is essential because in the fused step we temporarily cache:
1311
+ # # [pending committed tokens + current mask tokens]
1312
+
1313
+ # # But after the forward, only pending committed tokens should remain cached.
1314
+ # # """
1315
+ # # if hasattr(past_key_values, "crop"):
1316
+ # # past_key_values.crop(max_length)
1317
+ # # return past_key_values
1318
+
1319
+ # # # Fallback for older DynamicCache-like implementations.
1320
+ # # # Standard HF cache shape is usually [B, num_heads, seq_len, head_dim].
1321
+ # # if hasattr(past_key_values, "key_cache") and hasattr(past_key_values, "value_cache"):
1322
+ # # for layer_idx in range(len(past_key_values.key_cache)):
1323
+ # # if past_key_values.key_cache[layer_idx] is not None:
1324
+ # # past_key_values.key_cache[layer_idx] = past_key_values.key_cache[layer_idx][..., :max_length, :]
1325
+ # # if past_key_values.value_cache[layer_idx] is not None:
1326
+ # # past_key_values.value_cache[layer_idx] = past_key_values.value_cache[layer_idx][..., :max_length, :]
1327
+ # # return past_key_values
1328
+
1329
+ # # raise RuntimeError(
1330
+ # # "past_key_values does not support crop(max_length), and no key_cache/value_cache fallback was found."
1331
+ # # )
1332
+ # def crop_past_key_values(past_key_values, max_length: int):
1333
+ # before_len = None
1334
+ # if hasattr(past_key_values, "get_seq_length"):
1335
+ # before_len = past_key_values.get_seq_length()
1336
+
1337
+ # if hasattr(past_key_values, "crop"):
1338
+ # print(
1339
+ # f"[CACHE CROP] using DynamicCache.crop | before={before_len}, target={max_length}",
1340
+ # flush=True,
1341
+ # )
1342
+ # past_key_values.crop(max_length)
1343
+
1344
+ # after_len = None
1345
+ # if hasattr(past_key_values, "get_seq_length"):
1346
+ # after_len = past_key_values.get_seq_length()
1347
+
1348
+ # print(
1349
+ # f"[CACHE CROP] after={after_len}",
1350
+ # flush=True,
1351
+ # )
1352
+
1353
+ # if after_len is not None and after_len != max_length:
1354
+ # raise RuntimeError(
1355
+ # f"DynamicCache.crop failed: target={max_length}, after={after_len}"
1356
+ # )
1357
+
1358
+ # return past_key_values
1359
+
1360
+ # print(
1361
+ # f"[CACHE CROP] DynamicCache.crop not found; using fallback | before={before_len}, target={max_length}",
1362
+ # flush=True,
1363
+ # )
1364
+
1365
+ # if hasattr(past_key_values, "key_cache") and hasattr(past_key_values, "value_cache"):
1366
+ # for layer_idx in range(len(past_key_values.key_cache)):
1367
+ # if past_key_values.key_cache[layer_idx] is not None:
1368
+ # past_key_values.key_cache[layer_idx] = past_key_values.key_cache[layer_idx][..., :max_length, :]
1369
+ # if past_key_values.value_cache[layer_idx] is not None:
1370
+ # past_key_values.value_cache[layer_idx] = past_key_values.value_cache[layer_idx][..., :max_length, :]
1371
+
1372
+ # after_len = None
1373
+ # if hasattr(past_key_values, "get_seq_length"):
1374
+ # after_len = past_key_values.get_seq_length()
1375
+
1376
+ # print(
1377
+ # f"[CACHE CROP] fallback after={after_len}",
1378
+ # flush=True,
1379
+ # )
1380
+
1381
+ # if after_len is not None and after_len != max_length:
1382
+ # raise RuntimeError(
1383
+ # f"Fallback crop failed: target={max_length}, after={after_len}"
1384
+ # )
1385
+
1386
+ # return past_key_values
1387
+
1388
+ # raise RuntimeError(
1389
+ # "past_key_values does not support crop(max_length), and no key_cache/value_cache fallback was found."
1390
+ # )
1391
+
1392
+ # def make_causal_mask(q_len: int, kv_len: int, past_len: int):
1393
+ # """
1394
+ # Query absolute positions:
1395
+ # past_len ... past_len + q_len - 1
1396
+
1397
+ # Key absolute positions:
1398
+ # 0 ... kv_len - 1
1399
+
1400
+ # A query can attend to keys with absolute position <= itself.
1401
+ # """
1402
+ # q_abs = torch.arange(past_len, past_len + q_len, device=device)[:, None]
1403
+ # k_abs = torch.arange(kv_len, device=device)[None, :]
1404
+
1405
+ # allowed = k_abs <= q_abs
1406
+
1407
+ # attn_mask = torch.zeros((q_len, kv_len), dtype=dtype, device=device)
1408
+ # attn_mask.masked_fill_(~allowed, torch.finfo(dtype).min)
1409
+ # attn_mask = attn_mask.unsqueeze(0).unsqueeze(1)
1410
+ # attn_mask = attn_mask.expand(batch_size, 1, q_len, kv_len)
1411
+ # return attn_mask
1412
+
1413
+ # def model_forward_with_cache(cur_embeds, past_key_values, past_len: int, store_kv: bool):
1414
+ # q_len = cur_embeds.shape[1]
1415
+ # kv_len = past_len + q_len
1416
+
1417
+ # attn_mask = make_causal_mask(
1418
+ # q_len=q_len,
1419
+ # kv_len=kv_len,
1420
+ # past_len=past_len,
1421
+ # )
1422
+
1423
+ # model_mask = {
1424
+ # "full_attention": attn_mask,
1425
+ # "sliding_attention": attn_mask,
1426
+ # }
1427
+
1428
+ # position_ids = position_ids_base[:, past_len:past_len + q_len]
1429
+ # if batch_size > 1:
1430
+ # position_ids = position_ids.expand(batch_size, -1)
1431
+
1432
+ # outputs = self.model(
1433
+ # inputs_embeds=cur_embeds,
1434
+ # attention_mask=model_mask,
1435
+ # position_ids=position_ids,
1436
+ # past_key_values=past_key_values,
1437
+ # use_cache=True,
1438
+ # store_kv=store_kv,
1439
+ # )
1440
+ # return outputs
1441
+
1442
+ # def embed_token_ids(token_ids: torch.LongTensor):
1443
+ # if token_ids.device != embed_device:
1444
+ # token_ids_for_embed = token_ids.to(embed_device, non_blocking=True)
1445
+ # else:
1446
+ # token_ids_for_embed = token_ids
1447
+
1448
+ # token_embeds = embed_layer(token_ids_for_embed)
1449
+
1450
+ # if token_embeds.device != device:
1451
+ # token_embeds = token_embeds.to(device=device, non_blocking=True)
1452
+
1453
+ # return token_embeds
1454
+
1455
+ # def choose_commit_from_logits(logits, window_len: int):
1456
+ # """
1457
+ # Same commit rule as your original function.
1458
+ # """
1459
+ # cand_ids, cand_probs = self._sample_tokens(
1460
+ # logits,
1461
+ # temperature,
1462
+ # top_k,
1463
+ # top_p,
1464
+ # )
1465
+
1466
+ # cand_ids = cand_ids.to(device)
1467
+ # cand_probs = cand_probs.to(device)
1468
+
1469
+ # high_conf = cand_probs >= confidence_threshold
1470
+
1471
+ # prefix_ok = torch.cumprod(high_conf.to(torch.int64), dim=1).bool()
1472
+ # commit_counts = prefix_ok.sum(dim=1)
1473
+
1474
+ # min_commit = max(1, min(int(min_commit_tokens), window_len))
1475
+ # commit_counts = torch.clamp(commit_counts, min=min_commit, max=window_len)
1476
+
1477
+ # # This still causes one GPU -> CPU sync.
1478
+ # # It is hard to avoid because Python control flow needs commit_count.
1479
+ # commit_count = int(commit_counts.min().item())
1480
+ # commit_count = max(1, min(commit_count, window_len))
1481
+
1482
+ # commit_ids = cand_ids[:, :commit_count]
1483
+ # return commit_ids, commit_count
1484
+
1485
+ # # ------------------------------------------------------------------
1486
+ # # 1. Prefill prompt into KV-cache.
1487
+ # # ------------------------------------------------------------------
1488
+ # past_key_values = DynamicCache()
1489
+ # cache_len = 0
1490
+
1491
+ # if prompt_len > 0:
1492
+ # prefill_outputs = model_forward_with_cache(
1493
+ # cur_embeds=inputs_embeds,
1494
+ # past_key_values=past_key_values,
1495
+ # past_len=0,
1496
+ # store_kv=True,
1497
+ # )
1498
+ # past_key_values = prefill_outputs.past_key_values
1499
+ # cache_len = prompt_len
1500
+
1501
+ # # ------------------------------------------------------------------
1502
+ # # 2. Dynamic-commit generation with one-step delayed cache.
1503
+ # # ------------------------------------------------------------------
1504
+ # generated_ids_chunks = []
1505
+ # generated_len = 0
1506
+
1507
+ # # pending_commit_ids are already generated and returned, but not yet written
1508
+ # # into KV-cache. They will be cached together with the next mask-window forward.
1509
+ # pending_commit_ids = None
1510
+ # pending_commit_len = 0
1511
+
1512
+ # while generated_len < gen_length:
1513
+ # remaining = gen_length - generated_len
1514
+ # window_len = min(block_size, remaining)
1515
+
1516
+ # mask_embeds = mask_embed_vec.expand(batch_size, window_len, -1)
1517
+
1518
+ # if pending_commit_ids is None:
1519
+ # # First round:
1520
+ # # No previous committed tokens are waiting to be cached.
1521
+ # # Just predict the current mask window.
1522
+ # outputs = model_forward_with_cache(
1523
+ # cur_embeds=mask_embeds,
1524
+ # past_key_values=past_key_values,
1525
+ # past_len=cache_len,
1526
+ # store_kv=False,
1527
+ # )
1528
+
1529
+ # mask_hidden_states = outputs.last_hidden_state
1530
+
1531
+ # else:
1532
+ # # Fused round:
1533
+ # # Input = [previous committed real tokens] + [new mask window]
1534
+ # #
1535
+ # # We set store_kv=True so the previous committed tokens get cached.
1536
+ # # This will also temporarily cache the mask tokens, so we crop them away
1537
+ # # immediately after the forward.
1538
+ # pending_embeds = embed_token_ids(pending_commit_ids)
1539
+
1540
+ # cur_embeds = torch.cat(
1541
+ # [pending_embeds, mask_embeds],
1542
+ # dim=1,
1543
+ # )
1544
+
1545
+ # outputs = model_forward_with_cache(
1546
+ # cur_embeds=cur_embeds,
1547
+ # past_key_values=past_key_values,
1548
+ # past_len=cache_len,
1549
+ # store_kv=True,
1550
+ # )
1551
+
1552
+ # # The model output contains:
1553
+ # # [:, :pending_commit_len] -> hidden states for committed tokens
1554
+ # # [:, pending_commit_len:] -> hidden states for current mask window
1555
+ # mask_hidden_states = outputs.last_hidden_state[:, pending_commit_len:, :]
1556
+
1557
+ # past_key_values = outputs.past_key_values
1558
+
1559
+ # # Critical step:
1560
+ # # Remove the newly appended mask tokens from KV-cache.
1561
+ # # Keep only:
1562
+ # # old cache + pending committed tokens
1563
+ # cache_len = cache_len + pending_commit_len
1564
+ # past_key_values = crop_past_key_values(past_key_values, cache_len)
1565
+
1566
+ # pending_commit_ids = None
1567
+ # pending_commit_len = 0
1568
+
1569
+ # # Predict tokens only for the current mask window.
1570
+ # logits = self.lm_head(mask_hidden_states).float()
1571
+
1572
+ # commit_ids, commit_count = choose_commit_from_logits(
1573
+ # logits=logits,
1574
+ # window_len=window_len,
1575
+ # )
1576
+
1577
+ # generated_ids_chunks.append(commit_ids)
1578
+ # generated_len += commit_count
1579
+
1580
+ # # Do not immediately cache commit_ids.
1581
+ # # Cache them in the next iteration by concatenating them before the next mask window.
1582
+ # pending_commit_ids = commit_ids
1583
+ # pending_commit_len = commit_count
1584
+
1585
+ # # Same EOS behavior as your original version.
1586
+ # # This syncs GPU -> CPU. For pure speed benchmarking, pass eos_token_id=None.
1587
+ # if eos_token_id is not None:
1588
+ # if bool((commit_ids == eos_token_id).any().item()):
1589
+ # break
1590
+
1591
+ # # If generation ends while some commit_ids are still pending, no need to cache them,
1592
+ # # because we are done and only return token IDs.
1593
+
1594
+ # if len(generated_ids_chunks) == 0:
1595
+ # return torch.empty((batch_size, 0), dtype=torch.long, device=device)
1596
+
1597
+ # generated_ids = torch.cat(generated_ids_chunks, dim=1)
1598
+ # return generated_ids[:, :gen_length]
1599
+ @torch.inference_mode()
1600
+ def generate_with_bd3lm(
1601
+ self,
1602
+ inputs_embeds: torch.FloatTensor,
1603
+ gen_length: int = 256,
1604
+ steps: int = 8,
1605
+ temperature: float = 0.0,
1606
+ top_k: int = 0,
1607
+ top_p: float = 1.0,
1608
+ remasking_strategy: str = "dynamic_commit",
1609
+ confidence_threshold: float = 0.85,
1610
+ min_commit_tokens: int = 1,
1611
+ **kwargs,
1612
+ ):
1613
+ """
1614
+ Optimized dynamic-commit causal generation with delayed/fused KV-cache update.
1615
+
1616
+ Main speed-oriented changes:
1617
+ 1. Disable crop debug prints and cache-length checks.
1618
+ 2. Disable EOS checking by default to avoid per-step GPU -> CPU sync.
1619
+ 3. Remove redundant block_size lookup, since the original function finally used steps.
1620
+ 4. Use a faster greedy confidence path when temperature=0, top_k=0, top_p=1.
1621
+ 5. Build causal masks with only the current upper-triangular block masked.
1622
+ """
1623
+ from transformers.cache_utils import DynamicCache
1624
+
1625
+ device = inputs_embeds.device
1626
+ dtype = inputs_embeds.dtype
1627
+ batch_size = inputs_embeds.shape[0]
1628
+ prompt_len = inputs_embeds.shape[1]
1629
+
1630
+ if gen_length <= 0:
1631
+ return torch.empty((batch_size, 0), dtype=torch.long, device=device)
1632
+
1633
+ # In your original function, block_size is finally overwritten by steps.
1634
+ # Keep this behavior, but allow kwargs["block_size"] to override it if needed.
1635
+ block_size = int(kwargs.pop("block_size", steps))
1636
+ if block_size <= 0:
1637
+ raise ValueError(f"block_size must be positive, got {block_size}.")
1638
+
1639
+ mask_id = getattr(self, "mask_token_id", 151671)
1640
+
1641
+ # EOS checking causes GPU -> CPU synchronization.
1642
+ # Disabled by default for speed. Set check_eos=True if exact EOS stopping is required.
1643
+ # check_eos = bool(kwargs.pop("check_eos", False))
1644
+ check_eos = True
1645
+ eos_token_id = kwargs.get("eos_token_id", 151645)
1646
+
1647
+ embed_layer = self.get_input_embeddings()
1648
+ embed_device = embed_layer.weight.device
1649
+
1650
+ max_total_len = prompt_len + gen_length + 2 * block_size + 8
1651
+ position_ids_base = torch.arange(
1652
+ max_total_len,
1653
+ device=device,
1654
+ dtype=torch.long,
1655
+ ).unsqueeze(0)
1656
+
1657
+ # Cache mask embedding once.
1658
+ mask_id_tensor = torch.tensor([[mask_id]], dtype=torch.long, device=embed_device)
1659
+ mask_embed_vec = embed_layer(mask_id_tensor)
1660
+ if mask_embed_vec.device != device:
1661
+ mask_embed_vec = mask_embed_vec.to(device=device, non_blocking=True)
1662
+
1663
+ neg_inf = torch.finfo(dtype).min
1664
+ future_mask_cache = {}
1665
+
1666
+ def crop_past_key_values(past_key_values, max_length: int):
1667
+ """
1668
+ Keep only the first max_length cached tokens.
1669
+
1670
+ This removes temporary mask-token KV entries after the fused forward.
1671
+ Debug prints and expensive validation checks are removed.
1672
+ """
1673
+ if hasattr(past_key_values, "crop"):
1674
+ past_key_values.crop(max_length)
1675
+ return past_key_values
1676
+
1677
+ if hasattr(past_key_values, "key_cache") and hasattr(past_key_values, "value_cache"):
1678
+ for layer_idx in range(len(past_key_values.key_cache)):
1679
+ if past_key_values.key_cache[layer_idx] is not None:
1680
+ past_key_values.key_cache[layer_idx] = (
1681
+ past_key_values.key_cache[layer_idx][..., :max_length, :]
1682
+ )
1683
+ if past_key_values.value_cache[layer_idx] is not None:
1684
+ past_key_values.value_cache[layer_idx] = (
1685
+ past_key_values.value_cache[layer_idx][..., :max_length, :]
1686
+ )
1687
+ return past_key_values
1688
+
1689
+ raise RuntimeError(
1690
+ "past_key_values does not support crop(max_length), "
1691
+ "and no key_cache/value_cache fallback was found."
1692
+ )
1693
+
1694
+ def get_future_mask(q_len: int):
1695
+ """
1696
+ Boolean upper-triangular mask for the current query block only.
1697
+ Cached by q_len to avoid rebuilding it repeatedly.
1698
+ """
1699
+ mask = future_mask_cache.get(q_len, None)
1700
+ if mask is None:
1701
+ mask = torch.ones((q_len, q_len), dtype=torch.bool, device=device).triu(1)
1702
+ future_mask_cache[q_len] = mask
1703
+ return mask
1704
+
1705
+ def make_causal_mask(q_len: int, kv_len: int, past_len: int):
1706
+ """
1707
+ Causal additive mask.
1708
+
1709
+ Past KV positions are always visible.
1710
+ Only the current q_len x q_len block needs upper-triangular masking.
1711
+ """
1712
+ attn_mask = torch.zeros((q_len, kv_len), dtype=dtype, device=device)
1713
+
1714
+ if q_len > 1:
1715
+ future_mask = get_future_mask(q_len)
1716
+ attn_mask[:, past_len:past_len + q_len].masked_fill_(future_mask, neg_inf)
1717
+
1718
+ attn_mask = attn_mask.unsqueeze(0).unsqueeze(1)
1719
+ return attn_mask.expand(batch_size, 1, q_len, kv_len)
1720
+
1721
+ def model_forward_with_cache(cur_embeds, past_key_values, past_len: int, store_kv: bool):
1722
+ q_len = cur_embeds.shape[1]
1723
+ kv_len = past_len + q_len
1724
+
1725
+ attn_mask = make_causal_mask(
1726
+ q_len=q_len,
1727
+ kv_len=kv_len,
1728
+ past_len=past_len,
1729
+ )
1730
+
1731
+ model_mask = {
1732
+ "full_attention": attn_mask,
1733
+ "sliding_attention": attn_mask,
1734
+ }
1735
+
1736
+ position_ids = position_ids_base[:, past_len:past_len + q_len]
1737
+ if batch_size > 1:
1738
+ position_ids = position_ids.expand(batch_size, -1)
1739
+
1740
+ return self.model(
1741
+ inputs_embeds=cur_embeds,
1742
+ attention_mask=model_mask,
1743
+ position_ids=position_ids,
1744
+ past_key_values=past_key_values,
1745
+ use_cache=True,
1746
+ store_kv=store_kv,
1747
+ )
1748
+
1749
+ def embed_token_ids(token_ids: torch.LongTensor):
1750
+ if token_ids.device != embed_device:
1751
+ token_ids = token_ids.to(embed_device, non_blocking=True)
1752
+
1753
+ token_embeds = embed_layer(token_ids)
1754
+
1755
+ if token_embeds.device != device:
1756
+ token_embeds = token_embeds.to(device=device, non_blocking=True)
1757
+
1758
+ return token_embeds
1759
+
1760
+ greedy_decode = (
1761
+ (temperature is None or float(temperature) == 0.0)
1762
+ and int(top_k) == 0
1763
+ and float(top_p) >= 1.0
1764
+ )
1765
+
1766
+ def choose_commit_from_hidden(mask_hidden_states, window_len: int):
1767
+ """
1768
+ Predict tokens and decide how many prefix tokens to commit.
1769
+
1770
+ For greedy decoding, avoid materializing full softmax probabilities.
1771
+ The selected token probability is computed as:
1772
+
1773
+ p_max = exp(max_logit - logsumexp(logits))
1774
+
1775
+ This is enough for confidence-threshold commit.
1776
+ """
1777
+ logits = self.lm_head(mask_hidden_states)
1778
+
1779
+ # Confidence computation is more stable in fp32.
1780
+ if logits.dtype != torch.float32:
1781
+ logits_for_prob = logits.float()
1782
+ else:
1783
+ logits_for_prob = logits
1784
+
1785
+ if greedy_decode:
1786
+ max_logits, cand_ids = torch.max(logits_for_prob, dim=-1)
1787
+ cand_probs = torch.exp(max_logits - torch.logsumexp(logits_for_prob, dim=-1))
1788
+ else:
1789
+ cand_ids, cand_probs = self._sample_tokens(
1790
+ logits_for_prob,
1791
+ temperature,
1792
+ top_k,
1793
+ top_p,
1794
+ )
1795
+
1796
+ if cand_ids.device != device:
1797
+ cand_ids = cand_ids.to(device, non_blocking=True)
1798
+ if cand_probs.device != device:
1799
+ cand_probs = cand_probs.to(device, non_blocking=True)
1800
+
1801
+ high_conf = cand_probs >= confidence_threshold
1802
+
1803
+ # Prefix must be continuously high-confidence.
1804
+ prefix_ok = torch.cumprod(high_conf.to(torch.int64), dim=1).bool()
1805
+ commit_counts = prefix_ok.sum(dim=1)
1806
+
1807
+ min_commit = max(1, min(int(min_commit_tokens), window_len))
1808
+ commit_counts = torch.clamp(commit_counts, min=min_commit, max=window_len)
1809
+
1810
+ # One GPU -> CPU sync remains necessary because Python controls the loop length.
1811
+ commit_count = int(commit_counts.min().item())
1812
+ commit_count = max(1, min(commit_count, window_len))
1813
+
1814
+ commit_ids = cand_ids[:, :commit_count]
1815
+ return commit_ids, commit_count
1816
+
1817
+ # ------------------------------------------------------------------
1818
+ # 1. Prefill prompt into KV cache.
1819
+ # ------------------------------------------------------------------
1820
+ past_key_values = DynamicCache()
1821
+ cache_len = 0
1822
+
1823
+ if prompt_len > 0:
1824
+ prefill_outputs = model_forward_with_cache(
1825
+ cur_embeds=inputs_embeds,
1826
+ past_key_values=past_key_values,
1827
+ past_len=0,
1828
+ store_kv=True,
1829
+ )
1830
+ past_key_values = prefill_outputs.past_key_values
1831
+ cache_len = prompt_len
1832
+
1833
+ # ------------------------------------------------------------------
1834
+ # 2. Dynamic-commit generation with one-step delayed cache.
1835
+ # ------------------------------------------------------------------
1836
+ generated_ids_chunks = []
1837
+ generated_len = 0
1838
+
1839
+ pending_commit_ids = None
1840
+ pending_commit_len = 0
1841
+
1842
+ while generated_len < gen_length:
1843
+ remaining = gen_length - generated_len
1844
+ window_len = min(block_size, remaining)
1845
+
1846
+ mask_embeds = mask_embed_vec.expand(batch_size, window_len, -1)
1847
+
1848
+ if pending_commit_ids is None:
1849
+ outputs = model_forward_with_cache(
1850
+ cur_embeds=mask_embeds,
1851
+ past_key_values=past_key_values,
1852
+ past_len=cache_len,
1853
+ store_kv=False,
1854
+ )
1855
+
1856
+ mask_hidden_states = outputs.last_hidden_state
1857
+
1858
+ else:
1859
+ pending_embeds = embed_token_ids(pending_commit_ids)
1860
+
1861
+ cur_embeds = torch.cat(
1862
+ [pending_embeds, mask_embeds],
1863
+ dim=1,
1864
+ )
1865
+
1866
+ outputs = model_forward_with_cache(
1867
+ cur_embeds=cur_embeds,
1868
+ past_key_values=past_key_values,
1869
+ past_len=cache_len,
1870
+ store_kv=True,
1871
+ )
1872
+
1873
+ mask_hidden_states = outputs.last_hidden_state[:, pending_commit_len:, :]
1874
+
1875
+ past_key_values = outputs.past_key_values
1876
+
1877
+ cache_len = cache_len + pending_commit_len
1878
+ past_key_values = crop_past_key_values(past_key_values, cache_len)
1879
+
1880
+ pending_commit_ids = None
1881
+ pending_commit_len = 0
1882
+
1883
+ commit_ids, commit_count = choose_commit_from_hidden(
1884
+ mask_hidden_states=mask_hidden_states,
1885
+ window_len=window_len,
1886
+ )
1887
+
1888
+ generated_ids_chunks.append(commit_ids)
1889
+ generated_len += commit_count
1890
+
1891
+ pending_commit_ids = commit_ids
1892
+ pending_commit_len = commit_count
1893
+
1894
+ # Disabled by default because this introduces GPU -> CPU synchronization.
1895
+ if check_eos and eos_token_id is not None:
1896
+ if bool((commit_ids == eos_token_id).any().item()):
1897
+ break
1898
+
1899
+ if len(generated_ids_chunks) == 0:
1900
+ return torch.empty((batch_size, 0), dtype=torch.long, device=device)
1901
+
1902
+ generated_ids = torch.cat(generated_ids_chunks, dim=1)
1903
+ return generated_ids[:, :gen_length]
1904
+
1905
+
1906
+
1907
+ def _sample_tokens(self, logits, temperature=0.0, top_k=0, top_p=1.0):
1908
+ """Sample tokens with temperature, top-k, and top-p."""
1909
+ batch_size = logits.shape[0]
1910
+ seq_len = logits.shape[1]
1911
+ vocab_size = logits.shape[-1]
1912
+
1913
+ logits_2d = logits.reshape(-1, vocab_size)
1914
+
1915
+ if temperature == 0:
1916
+ # Greedy sampling
1917
+ tokens = torch.argmax(logits_2d, dim=-1, keepdim=True)
1918
+ probs = F.softmax(logits_2d, dim=-1)
1919
+ token_probs = torch.gather(probs, -1, tokens)
1920
+ else:
1921
+ # Apply temperature
1922
+ logits_scaled = logits_2d / temperature
1923
+
1924
+ # Apply top-k
1925
+ if top_k > 0:
1926
+ values, _ = torch.topk(logits_scaled, top_k)
1927
+ min_values = values[:, -1:]
1928
+ logits_scaled = torch.where(logits_scaled < min_values, float('-inf'), logits_scaled)
1929
+
1930
+ # Apply top-p
1931
+ if top_p < 1.0:
1932
+ sorted_logits, sorted_indices = torch.sort(logits_scaled, descending=True)
1933
+ cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
1934
+ sorted_mask = cumulative_probs > top_p
1935
+ sorted_mask[:, 1:] = sorted_mask[:, :-1].clone()
1936
+ sorted_mask[:, 0] = False
1937
+ mask_indices = torch.scatter(
1938
+ torch.zeros_like(logits_scaled, dtype=torch.bool),
1939
+ -1, sorted_indices, sorted_mask
1940
+ )
1941
+ logits_scaled = logits_scaled.masked_fill(mask_indices, float('-inf'))
1942
+
1943
+ probs = F.softmax(logits_scaled, dim=-1)
1944
+ tokens = torch.multinomial(probs, num_samples=1)
1945
+ token_probs = torch.gather(probs, -1, tokens)
1946
+
1947
+ return tokens.view(batch_size, seq_len), token_probs.view(batch_size, seq_len)
1948
+
1949
+ def _get_num_transfer_tokens(self, block_length, steps):
1950
+ """Calculate how many tokens to unmask at each step."""
1951
+ if steps == 0:
1952
+ return torch.zeros(1, dtype=torch.int64)
1953
+ base = block_length // steps
1954
+ remainder = block_length % steps
1955
+ num_transfer = torch.zeros(steps + 1, dtype=torch.int64) + base
1956
+ num_transfer[:remainder] += 1
1957
+ return num_transfer
1958
+
1959
+ from transformers import AutoConfig, AutoModelForCausalLM
1960
+
1961
+ AutoConfig.register("diffusionvl_qwenvl", DiffusionVL_Qwen2_5_VL_Config)
1962
+ AutoModelForCausalLM.register(DiffusionVL_Qwen2_5_VL_Config, DiffusionVL_Qwen2_5_VL_ForConditionalGeneration)
1963
+
1964
+
1965
+ __all__ = [
1966
+ "DiffusionVL_Qwen2_5_VL_Config",
1967
+ "DiffusionVL_Qwen2_5_VL_VisionConfig",
1968
+ "DiffusionVL_Qwen2_5_VL_PreTrainedModel",
1969
+ "DiffusionVL_Qwen2_5_VL_Model",
1970
+ "DiffusionVL_Qwen2_5_VL_ForConditionalGeneration",
1971
+ ]
preprocessor_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "min_pixels": 3136,
3
+ "max_pixels": 12845056,
4
+ "patch_size": 14,
5
+ "temporal_patch_size": 2,
6
+ "merge_size": 2,
7
+ "image_mean": [
8
+ 0.48145466,
9
+ 0.4578275,
10
+ 0.40821073
11
+ ],
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "image_processor_type": "Qwen2VLImageProcessor",
18
+ "use_fast": false,
19
+ "processor_class": "DiffusionVL_Qwen2_5_VL_Processor",
20
+ "auto_map": {
21
+ "AutoProcessor": "processing_diffusionvl_qwen2_5_vl.DiffusionVL_Qwen2_5_VL_Processor"
22
+ }
23
+ }
processing_diffusionvl_qwen2_5_vl.py ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 The HustVL Team and The HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on Qwen2.5-VL, which is derived from EleutherAI's GPT-NeoX library
5
+ # and the GPT-NeoX and OPT implementations. It has been modified to create DiffusionVL.
6
+ #
7
+ # Licensed under the Apache License, Version 2.0 (the "License");
8
+ # you may not use this file except in compliance with the License.
9
+ # You may obtain a copy of the License at
10
+ #
11
+ # http://www.apache.org/licenses/LICENSE-2.0
12
+ #
13
+ # Unless required by applicable law or agreed to in writing, software
14
+ # distributed under the License is distributed on an "AS IS" BASIS,
15
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16
+ # See the License for the specific language governing permissions and
17
+ # limitations under the License.
18
+ """
19
+ DiffusionVL Processor - Combines image processor and tokenizer.
20
+ """
21
+
22
+ import re
23
+ from typing import List, Optional, Union
24
+
25
+ import torch
26
+
27
+ from transformers.feature_extraction_utils import BatchFeature
28
+ from transformers.image_utils import ImageInput
29
+ from transformers.processing_utils import ProcessingKwargs, ProcessorMixin, Unpack
30
+ from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
31
+ from transformers.video_utils import VideoInput
32
+
33
+
34
+ IMAGE_TOKEN_INDEX = -200
35
+ DEFAULT_IMAGE_TOKEN = "<image>"
36
+
37
+
38
+ class DiffusionVL_Qwen2_5_VL_ProcessorKwargs(ProcessingKwargs, total=False):
39
+ """Keyword arguments for DiffusionVL_Qwen2_5_VL_Processor."""
40
+
41
+ _defaults = {
42
+ "text_kwargs": {
43
+ "padding": False,
44
+ },
45
+ }
46
+
47
+
48
+ def tokenizer_image_token(
49
+ prompt: str,
50
+ tokenizer,
51
+ image_token_index: int = IMAGE_TOKEN_INDEX,
52
+ return_tensors: Optional[str] = None,
53
+ ) -> Union[List[int], torch.Tensor]:
54
+ """
55
+ Tokenize text with image placeholders, replacing <image> with IMAGE_TOKEN_INDEX.
56
+
57
+ This implementation matches the training code (llava/mm_utils.py::tokenizer_image_token).
58
+
59
+ Args:
60
+ prompt: Input text containing <image> placeholders.
61
+ tokenizer: The tokenizer to use for encoding text.
62
+ image_token_index: The token index to use for image placeholders.
63
+ return_tensors: If "pt", return a PyTorch tensor.
64
+
65
+ Returns:
66
+ List of token IDs or a PyTorch tensor.
67
+ """
68
+ # Tokenize each chunk (matching training code behavior)
69
+ prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split(DEFAULT_IMAGE_TOKEN)]
70
+
71
+ def insert_separator(X, sep):
72
+ return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]
73
+
74
+ input_ids = []
75
+ offset = 0
76
+
77
+ # Handle BOS token if present (matching training code)
78
+ if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
79
+ offset = 1
80
+ input_ids.append(prompt_chunks[0][0])
81
+
82
+ for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
83
+ input_ids.extend(x[offset:])
84
+
85
+ if return_tensors is not None:
86
+ if return_tensors == "pt":
87
+ return torch.tensor(input_ids, dtype=torch.long)
88
+ raise ValueError(f"Unsupported tensor type: {return_tensors}")
89
+ return input_ids
90
+
91
+
92
+ class DiffusionVL_Qwen2_5_VL_Processor(ProcessorMixin):
93
+ r"""
94
+ Constructs a DiffusionVL processor which wraps an image processor and a tokenizer into a single processor.
95
+
96
+ [`DiffusionVL_Qwen2_5_VL_Processor`] offers all the functionalities of [`Qwen2VLImageProcessor`] and [`Qwen2TokenizerFast`].
97
+ See the [`~DiffusionVL_Qwen2_5_VL_Processor.__call__`] and [`~DiffusionVL_Qwen2_5_VL_Processor.decode`] for more information.
98
+
99
+ This processor uses LLaVA-style image token handling:
100
+ - `<image>` in text is replaced with `IMAGE_TOKEN_INDEX` (-200) in input_ids
101
+ - The model's `prepare_inputs_labels_for_multimodal` replaces -200 with actual image features
102
+
103
+ Args:
104
+ image_processor ([`Qwen2VLImageProcessor`], *optional*):
105
+ The image processor is a required input.
106
+ tokenizer ([`Qwen2TokenizerFast`], *optional*):
107
+ The tokenizer is a required input.
108
+ chat_template (`str`, *optional*):
109
+ A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string.
110
+
111
+ Example:
112
+
113
+ ```python
114
+ >>> from transformers import AutoProcessor
115
+ >>> from PIL import Image
116
+
117
+ >>> processor = AutoProcessor.from_pretrained("path/to/model", trust_remote_code=True)
118
+
119
+ >>> # Prepare text with image placeholder
120
+ >>> messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe this image."}]}]
121
+ >>> text = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
122
+
123
+ >>> # Process image and text
124
+ >>> image = Image.open("image.jpg")
125
+ >>> inputs = processor(text=[text], images=[image], return_tensors="pt")
126
+ ```
127
+ """
128
+
129
+ attributes = ["image_processor", "tokenizer"]
130
+ image_processor_class = "Qwen2VLImageProcessor"
131
+ tokenizer_class = ("Qwen2Tokenizer", "Qwen2TokenizerFast")
132
+
133
+ def __init__(
134
+ self,
135
+ image_processor=None,
136
+ tokenizer=None,
137
+ chat_template: Optional[str] = None,
138
+ **kwargs,
139
+ ):
140
+ self.image_token = DEFAULT_IMAGE_TOKEN
141
+ self.image_token_index = IMAGE_TOKEN_INDEX
142
+
143
+ super().__init__(image_processor, tokenizer, chat_template=chat_template)
144
+
145
+ def __call__(
146
+ self,
147
+ images: Optional[ImageInput] = None,
148
+ text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
149
+ videos: Optional[VideoInput] = None,
150
+ **kwargs: Unpack[DiffusionVL_Qwen2_5_VL_ProcessorKwargs],
151
+ ) -> BatchFeature:
152
+ """
153
+ Main method to prepare for the model one or several sequences and image(s).
154
+
155
+ This method forwards the `text` and `kwargs` arguments to Qwen2TokenizerFast's [`~Qwen2TokenizerFast.__call__`]
156
+ if `text` is not `None` to encode the text. To prepare the vision inputs, this method forwards the `images`
157
+ and `kwargs` arguments to Qwen2VLImageProcessor's [`~Qwen2VLImageProcessor.__call__`] if `images` is not `None`.
158
+
159
+ The text should contain `<image>` placeholders where images should be inserted.
160
+ These will be replaced with `IMAGE_TOKEN_INDEX` (-200) in the output input_ids.
161
+
162
+ Args:
163
+ images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, *optional*):
164
+ The image or batch of images to be prepared. Each image can be a PIL image, NumPy array, or PyTorch
165
+ tensor. Both channels-first and channels-last formats are supported.
166
+ text (`str`, `List[str]`, *optional*):
167
+ The sequence or batch of sequences to be encoded. Each sequence should be a string containing
168
+ `<image>` placeholders where images will be inserted.
169
+ videos (`np.ndarray`, `torch.Tensor`, `List[np.ndarray]`, *optional*):
170
+ The video or batch of videos to be prepared. Currently not fully supported.
171
+ return_tensors (`str` or [`~utils.TensorType`], *optional*):
172
+ If set, will return tensors of a particular framework. Acceptable values are:
173
+ - `'pt'`: Return PyTorch `torch.Tensor` objects.
174
+ - `'np'`: Return NumPy `np.ndarray` objects.
175
+
176
+ Returns:
177
+ [`BatchFeature`]: A [`BatchFeature`] with the following fields:
178
+
179
+ - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
180
+ - **attention_mask** -- List of indices specifying which tokens should be attended to by the model.
181
+ - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
182
+ - **image_grid_thw** -- List of image 3D grid dimensions. Returned when `images` is not `None`.
183
+ """
184
+ output_kwargs = self._merge_kwargs(
185
+ DiffusionVL_Qwen2_5_VL_ProcessorKwargs,
186
+ tokenizer_init_kwargs=self.tokenizer.init_kwargs,
187
+ **kwargs,
188
+ )
189
+
190
+ # Process images
191
+ image_inputs = {}
192
+ if images is not None:
193
+ image_inputs = self.image_processor(
194
+ images=images, **output_kwargs.get("images_kwargs", {})
195
+ )
196
+
197
+ # Handle text input
198
+ if text is None:
199
+ return BatchFeature(data=image_inputs)
200
+
201
+ if not isinstance(text, list):
202
+ text = [text]
203
+
204
+ # Tokenize with LLaVA-style image token handling
205
+ return_tensors = output_kwargs.get("text_kwargs", {}).pop("return_tensors", None)
206
+
207
+ all_input_ids = []
208
+ for t in text:
209
+ input_ids = tokenizer_image_token(
210
+ t, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors=None
211
+ )
212
+ all_input_ids.append(input_ids)
213
+
214
+ # Pad sequences
215
+ max_len = max(len(ids) for ids in all_input_ids)
216
+ padded_input_ids = []
217
+ attention_masks = []
218
+
219
+ pad_token_id = (
220
+ self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else 0
221
+ )
222
+
223
+ for ids in all_input_ids:
224
+ padding_length = max_len - len(ids)
225
+ padded_ids = ids + [pad_token_id] * padding_length
226
+ mask = [1] * len(ids) + [0] * padding_length
227
+ padded_input_ids.append(padded_ids)
228
+ attention_masks.append(mask)
229
+
230
+ text_inputs = {
231
+ "input_ids": padded_input_ids,
232
+ "attention_mask": attention_masks,
233
+ }
234
+
235
+ return BatchFeature(data={**text_inputs, **image_inputs}, tensor_type=return_tensors)
236
+
237
+ def build_conversation_input_ids(
238
+ self,
239
+ messages: List[dict],
240
+ images: Optional[List] = None,
241
+ add_generation_prompt: bool = True,
242
+ ) -> dict:
243
+ """
244
+ Build input_ids from conversation messages in LLaVA format.
245
+
246
+ This method converts a list of messages into a prompt string with `<image>` placeholders.
247
+ Uses LLaVA-style chat template format (trained format).
248
+
249
+ Args:
250
+ messages: List of message dicts with 'role' and 'content' keys.
251
+ Content can be a string or a list of dicts with 'type' key ('text' or 'image').
252
+ images: Optional list of images (used for validation).
253
+ add_generation_prompt: Whether to add generation prompt at the end.
254
+
255
+ Returns:
256
+ dict with 'text' key containing the prompt string with `<image>` placeholders.
257
+ """
258
+ # Build LLaVA-style prompt directly
259
+ # Format: <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>\nPrompt<|im_end|>\n<|im_start|>assistant\n
260
+
261
+ text_parts = []
262
+
263
+ for message in messages:
264
+ role = message.get("role", "user")
265
+ content = message.get("content", "")
266
+
267
+ text_parts.append(f"<|im_start|>{role}\n")
268
+
269
+ # Handle content - can be string or list of content items
270
+ if isinstance(content, str):
271
+ text_parts.append(content)
272
+ elif isinstance(content, list):
273
+ for item in content:
274
+ if isinstance(item, dict):
275
+ if item.get("type") == "image":
276
+ text_parts.append(DEFAULT_IMAGE_TOKEN)
277
+ elif item.get("type") == "text":
278
+ text_parts.append(item.get("text", ""))
279
+ else:
280
+ text_parts.append(str(item))
281
+
282
+ text_parts.append("<|im_end|>\n")
283
+
284
+ if add_generation_prompt:
285
+ text_parts.append("<|im_start|>assistant\n")
286
+
287
+ text = "".join(text_parts)
288
+ return {"text": text}
289
+
290
+ def batch_decode(self, *args, **kwargs):
291
+ """
292
+ Decode a batch of token IDs to text.
293
+
294
+ This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedTokenizer.batch_decode`].
295
+ Please refer to the docstring of this method for more information.
296
+ """
297
+ return self.tokenizer.batch_decode(*args, **kwargs)
298
+
299
+ def decode(self, *args, **kwargs):
300
+ """
301
+ Decode token IDs to text.
302
+
303
+ This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedTokenizer.decode`].
304
+ Please refer to the docstring of this method for more information.
305
+ """
306
+ return self.tokenizer.decode(*args, **kwargs)
307
+
308
+ @property
309
+ def model_input_names(self) -> List[str]:
310
+ """Return the list of model input names."""
311
+ tokenizer_names = self.tokenizer.model_input_names
312
+ image_processor_names = self.image_processor.model_input_names
313
+ return list(dict.fromkeys(tokenizer_names + image_processor_names))
314
+
315
+
316
+ __all__ = ["DiffusionVL_Qwen2_5_VL_Processor", "tokenizer_image_token"]
special_tokens_map.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>",
16
+ "<nl>",
17
+ "<fcel>",
18
+ "<ecel>",
19
+ "<lcel>",
20
+ "<ucel>",
21
+ "<xcel>",
22
+ "<ched>"
23
+ ],
24
+ "eos_token": {
25
+ "content": "<|im_end|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ },
31
+ "pad_token": {
32
+ "content": "<|endoftext|>",
33
+ "lstrip": false,
34
+ "normalized": false,
35
+ "rstrip": false,
36
+ "single_word": false
37
+ }
38
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
3
+ size 11421896
tokenizer_config.json ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151672": {
174
+ "content": "<nl>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151673": {
182
+ "content": "<fcel>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151674": {
190
+ "content": "<ecel>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151675": {
198
+ "content": "<lcel>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151676": {
206
+ "content": "<ucel>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ },
213
+ "151677": {
214
+ "content": "<xcel>",
215
+ "lstrip": false,
216
+ "normalized": false,
217
+ "rstrip": false,
218
+ "single_word": false,
219
+ "special": false
220
+ },
221
+ "151678": {
222
+ "content": "<ched>",
223
+ "lstrip": false,
224
+ "normalized": false,
225
+ "rstrip": false,
226
+ "single_word": false,
227
+ "special": false
228
+ }
229
+ },
230
+ "additional_special_tokens": [
231
+ "<|im_start|>",
232
+ "<|im_end|>",
233
+ "<|object_ref_start|>",
234
+ "<|object_ref_end|>",
235
+ "<|box_start|>",
236
+ "<|box_end|>",
237
+ "<|quad_start|>",
238
+ "<|quad_end|>",
239
+ "<|vision_start|>",
240
+ "<|vision_end|>",
241
+ "<|vision_pad|>",
242
+ "<|image_pad|>",
243
+ "<|video_pad|>",
244
+ "<nl>",
245
+ "<fcel>",
246
+ "<ecel>",
247
+ "<lcel>",
248
+ "<ucel>",
249
+ "<xcel>",
250
+ "<ched>"
251
+ ],
252
+ "bos_token": null,
253
+ "clean_up_tokenization_spaces": false,
254
+ "eos_token": "<|im_end|>",
255
+ "errors": "replace",
256
+ "extra_special_tokens": {},
257
+ "model_max_length": 8192,
258
+ "pad_token": "<|endoftext|>",
259
+ "padding_side": "right",
260
+ "split_special_tokens": false,
261
+ "tokenizer_class": "Qwen2Tokenizer",
262
+ "unk_token": null
263
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff