DLLM-Agent commited on
Commit
8973686
·
verified ·
1 Parent(s): 295d019

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ images/image.png filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ OPENPANGU MODEL LICENSE AGREEMENT VERSION 1.0
2
+
3
+ This OPENPANGU MODEL LICENSE AGREEMENT VERSION 1.0 (the "Agreement") is a legal agreement between You and Huawei Technologies Co., Ltd. ("Huawei", "We" or "Us"), and it governs Your reproducing, use, modification, and distribution of openPangu as made available by Huawei under this Agreement.
4
+
5
+ By using, reproducing, modifying, distributing, performing or displaying any portion or element of openPangu, or otherwise accepting the terms of this Agreement, You agree to be bound by this Agreement.
6
+
7
+ 1. Definitions.
8
+ 1.1. “openPangu” or “Model” means openPangu large language models and software, including trained model weights, parameters (including optimizer states), accompanying source code and scripts released under this Agreement.
9
+ 1.2. “Derivative Model” means all (1) modifications to the Model, (2) works based on the Model, and (3) any other derivative works of the Model. For clarity, information or content results from operating or otherwise using the Model is not a Derivative Model.
10
+ 1.3. “You” or “Your” means an individual or Legal Entity exercising permissions granted by this Agreement and/or using the Model for any purpose.
11
+ 1.4. “Third Party” or “Third Parties” means individuals or legal entities that are not under common control with Us or You.
12
+
13
+ 2. License Grant. Subject to Your full compliance with the terms and conditions of this Agreement, We hereby grant to You a perpetual, worldwide, non-exclusive, non-transferable, no-charge, royalty-free license (except as stated in Section 3) to use, reproduce, modify, and distribute the Model.
14
+
15
+ 3. Conditions for License Grant. You represent and warrant that You will not, access, download, install, run, deploy, integrate, modify, or otherwise use the Model, directly or indirectly, within the European Union.
16
+
17
+
18
+ 4. Redistribution.
19
+ 4.1. If You distribute the Model or Derivative Model, You shall retain in Your distribution (1) a copy of this agreement, and (2) all copyright notices and other notices of origin included in the Model that are applicable to Your distribution.
20
+ 4.2. Further, if You distribute or make available to Third Parties a product or service (including another AI model) based on the Model, You are required to (1) display the acknowledgement “Powered by openPangu” and (2) include a trademark notice “openPangu is a trademark of Huawei Technologies Co., Ltd.” on related webpages, user manuals, product documentations or other advertising materials mentioning features of the Model.
21
+ 4.3. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for Derivative Model made by You as a whole, provided Your use, reproduction, and distribution of the Model otherwise complies with the terms and conditions of this Agreement.
22
+
23
+ 5. Ownership. We do not claim ownership to any information or content generated using the Model or Derivative Model that are made by You. You are solely responsible for evaluating the accuracy and appropriateness of such information or content for Your use case.
24
+
25
+ 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of Huawei, except as required for complying with Section 4.2.
26
+
27
+ 7. Indemnity. You will indemnify and hold harmless Huawei from and against any claim by any third party arising out of or related to Your use or distribution of the Model or Derivative Model made by You (e.g. a violation against Section 3). For avoidance of doubt, “third party” in this clause include supervisory authorities.
28
+
29
+ 8. THE MODEL IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE, NONINFRINGEMENT, ACCURACY, OR THE ABSENCE OF LATENT OR OTHER DEFECTS OR ERRORS, WHETHER OR NOT DISCOVERABLE, ALL TO THE GREATEST EXTENT PERMISSIBLE UNDER APPLICABLE LAW.
30
+
31
+ 9. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MODEL, IN WHOLE OR IN PART, NO MATTER HOW IT’S CAUSED OR THE LEGAL THEORY IT IS BASED ON, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
32
+
33
+
34
+ END OF THE TERMS AND CONDITIONS
README.md CHANGED
@@ -1,3 +1,172 @@
1
- ---
2
- license: unknown
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Open-Source Pangu openPangu-7B-Diffusion-DeepDiver
2
+
3
+ [中文](README_CN.md) | English
4
+
5
+ ## 1. Introduction
6
+
7
+ openPangu-7B-Diffusion-DeepDiver is a 7B-parameter language model based on block diffusion large language models (Diffusion LLM), specifically trained and fine-tuned for multi-agent scenarios (including tool invocation, information retrieval, and multi-step decision-making). Its underlying architecture and inference pipeline follow the design of [openPangu-R-7B-Diffusion](https://ai.gitcode.com/ascend-tribe/openPangu-R-7B-Diffusion/blob/main/README.md) (including block-wise denoising and bidirectional attention within blocks), thus maintaining consistent structures and interfaces for single-pass generation and parallel decoding.
8
+
9
+ For complete evaluations and training details, please refer to the technical report *“DLLM Agent: See Farther, Run Faster”* ([arXiv:2602.07451v2](https://arxiv.org/html/2602.07451v2)).
10
+
11
+ - openPangu-7B-Diffusion-DeepDiver: Agent model with a context length of 32k.
12
+
13
+ ### Key Features
14
+ - Uses the same DLLM architecture and iterative inference pipeline as openPangu-R-7B-Diffusion.
15
+ - Trained and fine-tuned with data and objectives tailored for in-depth agent research, making it more robust in multi-round tool invocation and planning tasks.
16
+ - Introduces context-clean corruption and span-aware attention alignment to reduce noise propagation in multi-round agent dialogues and improve the reliability of tool invocation formats.
17
+
18
+ #### Inference
19
+
20
+ ![Context_Causal_Block_Diffusion_LLM](images/Context_Causal_Block_Diffusion_LLM.png)
21
+
22
+ openPangu-7B-Diffusion-DeepDiver adopts **context-causal block diffusion decoding**, performing diffusion decoding block by block. Within each block, full attention is applied, while causal attention is used for preceding context. Once all tokens in a block are decoded, the entire block is stored in the historical KV cache with causal masking, and decoding proceeds to the first token of the next block.
23
+
24
+ - Supports variable-length inference and KV caching.
25
+ - Flexible context length, not limited by block size.
26
+ - Supports both autoregressive and block diffusion decoding.
27
+ - Uses confidence-threshold sampling, achieving up to 2.5× throughput improvement over standard autoregressive decoding.
28
+ - Similar to Fast dLLMv2, small blocks can be configured within each block to balance throughput and performance, with optimal results typically at small block sizes of 4 or 8.
29
+
30
+ ##### Inference in Agent Workflows
31
+
32
+ Integrated into the [DeepDiver v2 Agent](https://ai.gitcode.com/ascend-tribe/openPangu-Embedded-7B-DeepDiver/tree/main/deepdiver_v2) workflow, the model applies DLLM’s iterative denoising inference strategy for each round of tool generation.
33
+
34
+ Deepdiver v2 is a planner-centered MAS (Multi-Agent System) architecture that coordinates multiple executors.
35
+
36
+ For detailed information, refer to its [technical report](https://ai.gitcode.com/ascend-tribe/openPangu-Embedded-7B-DeepDiver).
37
+
38
+ #### Training
39
+
40
+ ![alt text](images/image.png)
41
+
42
+ ##### Training Corpus
43
+
44
+ The model is trained on 11k specially collected or synthesized agent trajectory datasets (including planner → seeker multi-agent interactions, real tool calls, and tool-return traces). These data aim to help the model learn to generate semantically consistent and format-compliant tool invocation instructions in multi-round interactions. See the “Agent-oriented Fine-tuning” section in the technical report for details.
45
+
46
+ ##### Supervision Method
47
+
48
+ Cross-entropy losses for both diffusion and autoregressive models are jointly optimized during training, ensuring stability and preserving reliable left-to-right generation.
49
+
50
+ ##### Masking and Attention Alignment
51
+
52
+ To address information contamination caused by diffusion when multi-round contexts and tool outputs are mixed, training applies context-clean corruption to mask irrelevant context segments and span-aware attention alignment within generation ranges. Experiments on agent datasets show that both techniques improve final information retrieval scores.
53
+
54
+ ## 2. Model Architecture
55
+
56
+ | | openPangu-7B-Diffusion-DeepDiver |
57
+ | :----------------------------: | :-------------------------: |
58
+ | **Architecture** | Dense |
59
+ | **Parameters (Non-Embedding)** | 7B |
60
+ | **Number of Layers** | 34 |
61
+ | **Hidden Dimension** | 12800 |
62
+ | **Attention Mechanism** | GQA |
63
+ | **Number of Attention Heads** | 32 for Q, 8 for KV |
64
+ | **Vocabulary Size** | 153k |
65
+ | **Context Length** | 32k |
66
+ | **Continued Training Tokens** | 700B |
67
+
68
+ ## 3. Evaluation Results
69
+
70
+ Table 1. Comparison results on a 110-question subset of BrowseComp-zh.
71
+
72
+ | Method | Accuracy (%) | Tool Calls | Agent Rounds | Tool Failure Rate |
73
+ | ---------------------------------- | -----------: | ---------: | -----------: | ----------------: |
74
+ | AR Agent (autoregressive backbone) | 15.5 | 7.5 | 14.8 | 1.9% |
75
+ | DLLM Agent (diffusion backbone) | 15.5 | 6.7 | 13.0 | 6.4% |
76
+
77
+ Although the final accuracy is comparable to AR on this subset, DLLM requires fewer tool calls and sub-agent rounds, and achieves about 30% average end-to-end latency reduction. However, DLLM shows a higher tool failure rate, indicating that it is still less stable than AR models.
78
+
79
+ ## 4. Deployment and Usage
80
+
81
+ ### 4.1 Environment Setup
82
+
83
+ ##### Hardware Requirements
84
+
85
+ Atlas 800T A2 (64GB). For drivers and firmware, see:
86
+ [[Atlas 800T A2](https://www.hiascend.com/hardware/firmware-drivers/community?product=4&model=26&cann=8.2.RC1.alpha003&driver=Ascend+HDK+25.0.RC1)].
87
+
88
+ ##### Software Environment
89
+
90
+ - OS: Linux (openEuler ≥ 24.03 recommended)
91
+ - CANN == 8.1.RC1. See [[CANN Install]](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=Ubuntu&Software=cannToolKit)
92
+ - python == 3.10
93
+ - torch == 2.6.0
94
+ - torch-npu == 2.6.0
95
+ - transformers == 4.53.2
96
+
97
+ The above configurations have been verified. Higher versions may be supported. Please submit an issue if you have questions.
98
+
99
+ ### 4.2 Inference Examples
100
+
101
+ Below is a simple example of using openPangu-7B-Diffusion-DeepDiver with the `transformers` framework and the Deepdiver v2 Agent framework.
102
+
103
+ #### Loading and Running
104
+
105
+ Before running, modify `generate.py` to specify the model path.
106
+
107
+ ```bash
108
+ cd inference
109
+ python generate.py
110
+ ````
111
+
112
+ For optimal throughput, set sampling parameters to `alg="confidence_threshold", threshold=0.9, num_small_blocks=1`, and choose an appropriate batch size based on hardware.
113
+
114
+ #### Service Deployment
115
+
116
+ Download the [lightweight service script](https://github.com/LinWeizheDragon/dllm-agent/blob/main/launch_server.py), place it in the model directory, and start the service:
117
+
118
+ ```bash
119
+ python launch_server.py --load /path/to/model --port 9999
120
+ ```
121
+
122
+ #### Deepdiver v2
123
+
124
+ Download the [Deepdiver v2 package](https://ai.gitcode.com/ascend-tribe/openPangu-Embedded-7B-DeepDiver/tree/main/deepdiver_v2) (no model weights required) and install it following official documentation.
125
+
126
+ Copy `env.template` to `config/.env`, set `MODEL_REQUEST_URL` to the model service URL, and modify `MODEL_NAME` to match the deployed model name (default: `local-diffusion-llm`).
127
+
128
+ Start the MCP service:
129
+
130
+ ```bash
131
+ python src/tools/mcp_server_standard.py
132
+ ```
133
+
134
+ Send a query to Deepdiver v2:
135
+
136
+ ```bash
137
+ python cli/demo.py -q "今天北京的天气怎么样?"
138
+ ```
139
+
140
+ For more usage details, refer to the official repository.
141
+
142
+ Currently, openPangu-7B-Diffusion-DeepDiver has only been trained and tested within the Deepdiver v2 framework. It has not been adapted for other agent frameworks or tasks, and performance on other setups is not guaranteed.
143
+
144
+ ## 5. License
145
+
146
+ When using the model or its outputs, please cite the technical report:
147
+ “DLLM Agent: See Farther, Run Faster” (arXiv:2602.07451v2).
148
+
149
+ Unless otherwise specified, openPangu-7B-Diffusion-DeepDiver is licensed under the OPENPANGU MODEL LICENSE AGREEMENT VERSION 1.0, which aims to promote the development of AI technologies. See the [LICENSE](LICENSE) file in the repository root for details.
150
+
151
+ ## 6. Disclaimer
152
+
153
+ Due to inherent technical limitations and the nature of AI-generated content, Huawei makes no guarantees regarding the following:
154
+
155
+ * The generated outputs may contain defects, inaccuracies, or inappropriate content, and do not represent Huawei’s views.
156
+ * The model is not guaranteed to be 100% accurate, reliable, complete, timely, secure, error-free, uninterrupted, or stable.
157
+ * The outputs do not constitute advice or decisions and do not guarantee authenticity, completeness, accuracy, legality, or usefulness. They cannot replace professional advice in medical, legal, or other domains. Users must make independent judgments, and Huawei assumes no responsibility.
158
+
159
+ ## 7. Feedback
160
+
161
+ For suggestions or feedback, please submit an issue or contact: [openPangu@huawei.com](mailto:openPangu@huawei.com).
162
+
163
+ ## 7. Citation
164
+
165
+ ```bibtex
166
+ @article{zhen2026dllm,
167
+ title={DLLM Agent: See Farther, Run Faster},
168
+ author={Zhen, Huiling and Lin, Weizhe and Liu, Renxi and Han, Kai and Li, Yiming and Tian, Yuchuan and Chen, Hanting and Li, Xiaoguang and Li, Xiaosong and Chen, Chen and others},
169
+ journal={arXiv preprint arXiv:2602.07451},
170
+ year={2026}
171
+ }
172
+ ```
README_CN.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 开源盘古openPangu-7B-Diffusion-DeepDiver
2
+
3
+ 中文 | [English](README.md)
4
+
5
+ ## 1. 简介
6
+
7
+ 开源盘古openPangu-7B-Diffusion-DeepDiver 是一个基于block diffusion大语言模型(Diffusion LLM)的7B语言模型,专门针对多agent场景(包含工具调用、信息检索和多步决策)进行了训练与微调。模型的底层架构与推理流程沿用了[openPangu-R-7B-Diffusion](https://ai.gitcode.com/ascend-tribe/openPangu-R-7B-Diffusion/blob/main/README.md)的设计(包括 block-wise denoising、块内双向注意力),因此在单次生成与并行解码能力上保持一致的结构与接口体验。
8
+
9
+ 该模型的完整评测、训练细节请参考技术报告《DLLM Agent: See Farther, Run Faster》([arXiv:2602.07451v2](https://arxiv.org/html/2602.07451v2))。
10
+
11
+ - 开源盘古openPangu-7B-Diffusion-DeepDiver:Agent模型,上下文长度为32k。
12
+
13
+ ### 主要特点:
14
+ - 与 openPangu-R-7B-Diffusion采用相同的 DLLM 架构与迭代推理流程。
15
+ - 专门面向 agent 深度研究场景的训练数据与微调目标,使模型在多轮工具调用与规划任务上更稳健。
16
+ - 引入 context-clean corruption和span-aware attention alignment的训练策略,以降低 diffusion在多轮agent对话时的噪声传播并提升工具调用格式的可靠性。
17
+
18
+ #### 推理
19
+ ![Context_Causal_Block_Diffusion_LLM](images/Context_Causal_Block_Diffusion_LLM.png)
20
+
21
+ 开源盘古openPangu-7B-Diffusion-DeepDiver采用**前文因果块扩散解码**,逐块进行扩散解码。解码过程中块内为全注意力,前文为因果注意力。当块内的token全部完成解码时,将整块token存入前文KV缓存,缓存采用因果注意力掩码,同时解码下一个block的首token。
22
+
23
+ - 支持变长推理和KV缓存。
24
+ - 灵活的上下文长度,不受块长度的限制。
25
+ - 支持自回归和块扩散两种解码方式。
26
+ - 使用confidence threshold采样,相比标准自回归解码,吞吐量最高可提升2.5倍。
27
+ - 类似于Fast dLLMv2在block内设置small block,可实现吞吐和效果的权衡,通常在small block长为4或8时表现最优。
28
+
29
+ ##### 在Agent流程中的推理
30
+ 集成于[DeepDiver v2 Agent](https://ai.gitcode.com/ascend-tribe/openPangu-Embedded-7B-DeepDiver/tree/main/deepdiver_v2)工作流,模型在每轮生成工具调用内容时采用DLLM 的迭代去噪方式做推理。
31
+ Deepdiver v2是以Planner(规划器)为中心, 协调多个Executor(执行器)的MAS(Multi-Agent System,多Agent系统)架构。
32
+ Deepdiver v2的详细说明参考其[技术报告](https://ai.gitcode.com/ascend-tribe/openPangu-Embedded-7B-DeepDiver)。
33
+
34
+
35
+ #### 训练
36
+
37
+ ![alt text](images/image.png)
38
+
39
+ ##### 训练语料
40
+
41
+ 使用11k专门采集或合成的的agent轨迹数据(含 planner → seeker的多agent交互、真实工具调用、工具返回结果的轨迹数据)。这些数据的目的是让模型学习在多轮交互中产生语义一致且符合调用格式的工具调用指令。详见技术报告关于“Agent-oriented Fine-tuning” 的讨论。
42
+
43
+ ##### 监督方式
44
+
45
+ 在训练中同步训练扩散模型和自回归模型的交叉熵损失,从而保证训练的稳定,并且保持模型可以稳定地从左到右生成。
46
+
47
+ ##### Masking 与注意力对齐
48
+
49
+ 为了解决多轮对话上下文与模型输出的工具调用合在一起时扩散造成的信息污染,训练时采用屏蔽无关context片段的做法(context-clean corruption)并对生成token的注意力做了生成范围内的对齐(span-aware attention alignment);在agent数据集上的测试表明这两项修改均能提升信息检索的最终得分。
50
+
51
+ ## 2. 模型架构
52
+
53
+ | | openPangu-7B-Diffusion-DeepDiver |
54
+ | :----------------------------: | :-------------------------: |
55
+ | **Architecture** | Dense |
56
+ | **Parameters (Non-Embedding)** | 7B |
57
+ | **Number of Layers** | 34 |
58
+ | **Hidden Dimension** | 12800 |
59
+ | **Attention Mechanism** | GQA |
60
+ | **Number of Attention Heads** | 32 for Q,8 for KV |
61
+ | **Vocabulary Size** | 153k |
62
+ | **Context Length** | 32k |
63
+ | **Continued training Tokens** | 700B |
64
+
65
+ ## 3. 测评结果
66
+
67
+ 表 1. BrowseComp-zh的110问题子集的模型结果对比。
68
+
69
+ | Method | 正确率 (%) | 工具调用次数 | Agent轮数 | 工具调用失败率 |
70
+ | ---------------------------------- | -----------: | ---------: | ---------: | ------------------: |
71
+ | AR Agent (autoregressive backbone) | 15.5 | 7.5 | 14.8 | 1.9% |
72
+ | DLLM Agent (diffusion backbone) | 15.5 | 6.7 | 13.0 | 6.4% |
73
+
74
+ 虽然最终准确率在该子集上与 AR 相当,但DLLM在工具调用次数与sub-Agent轮数上更节省,且在端到端延迟上显示约30%的平均加速,但 DLLM 也显示出更高的工具调用失败率,显示出DLLM相比AR模型依然不足够稳定。
75
+
76
+ ## 4. 部署和使用
77
+
78
+ ### 4.1 环境准备
79
+
80
+ ##### 硬件规格
81
+
82
+ Atlas 800T A2 (64GB),驱动与固件安装包获取请参照 [[Atlas 800T A2](https://www.hiascend.com/hardware/firmware-drivers/community?product=4&model=26&cann=8.2.RC1.alpha003&driver=Ascend+HDK+25.0.RC1)]。
83
+
84
+ ##### 软件环境
85
+
86
+ - 操作系统:Linux(推荐 openEuler>=24.03)
87
+ - CANN==8.1.RC1,安装准备及流程请参照 [[CANN Install]](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=Ubuntu&Software=cannToolKit)
88
+ - python==3.10
89
+ - torch==2.6.0
90
+ - torch-npu==2.6.0
91
+ - transformers==4.53.2
92
+
93
+ 以上软件配套经过验证,理论可以支持更高版本,如有疑问,可以提交 issue。
94
+
95
+ ### 4.2 推理样例
96
+
97
+ 下述内容提供 开源盘古openPangu-7B-Diffusion-DeepDiver 在 `transformers` 框架上结合deepdiver v2 Agent框架进行推理的一个简单示例:
98
+
99
+ #### 测试加载和运行
100
+
101
+ 运行前请修改 generate.py,添加模型路径。
102
+
103
+ ```bash
104
+ cd inference
105
+ python generate.py
106
+ ```
107
+
108
+ 与基准测试不同,为了实现最佳吞吐量,采样参数应设置为 `alg="confidence_threshold", threshold=0.9, num_small_blocks=1`,并根据设备选择合适的batch size。
109
+
110
+ #### 服务化部署
111
+ 下载[简易服务化脚本](https://github.com/LinWeizheDragon/dllm-agent/blob/main/launch_server.py),放入模型文件夹中,执行下面命令启动简易服务化部署:
112
+
113
+ ```bash
114
+ python launch_server.py --load /path/to/model --port 9999
115
+ ```
116
+
117
+ #### Deepdiver v2
118
+ 从Deepdiver v2官方仓库下载[Deepdiver v2包](https://ai.gitcode.com/ascend-tribe/openPangu-Embedded-7B-DeepDiver/tree/main/deepdiver_v2)(不需要下载模型权重),根据官方文档进行安装。将其中的`env.template`复制粘贴到`config/.env`后,在其中指定模型服务的url为`MODEL_REQUEST_URL`,将`MODEL_NAME`修改为模型服务对应的模型名称(默认为`local-diffusion-llm`)。
119
+
120
+ 启动MCP服务:
121
+ ```bash
122
+ python src/tools/mcp_server_standard.py
123
+ ```
124
+
125
+ 向Deepdiver v2发出query:
126
+
127
+ ```bash
128
+ python cli/demo.py -q "今天北京的天气怎么样?"
129
+ ```
130
+
131
+ Deepdiver v2的其他用法可以参考其官方仓库。
132
+
133
+ 目前开源盘古openPangu-7B-Diffusion-DeepDiver模型仅在Deepdiver v2框架上做过训练和测试,没有对其他Agent框架或者任务做适配。我们不能保证开源盘古openPangu-7B-Diffusion-DeepDiver模型在其他框架或者任务上的表现。
134
+
135
+ ## 5. 模型许可证
136
+
137
+ 请在使用模型或结果时引用技术报告:“DLLM Agent: See Farther, Run Faster” (arXiv:2602.07451v2)。
138
+
139
+ 除文件中对开源许可证另有约定外,开源盘古openPangu-7B-Diffusion-DeepDiver模型根据 OPENPANGU MODEL LICENSE AGREEMENT VERSION 1.0 授权,旨在允许使用并促进人工智能技术的进一步发展。有关详细信息,请参阅模型存储库根目录中的 [LICENSE](LICENSE) 文件。
140
+
141
+ ## 6. 免责声明
142
+
143
+ 由于 开源盘古openPangu-7B-Diffusion-DeepDiver(“模型”)所依赖的技术固有的技术限制,以及人工智能生成的内容是由开源盘古自动生成的,华为无法对以下事项做出任何保证:
144
+
145
+ - 尽管该模型的输出由 AI 算法生成,但不能排除某些信息可能存在缺陷、不合理或引起不适的可能性,生成的内容不代表华为的态度或立场;
146
+ - 无法保证该模型 100% 准确、可靠、功能齐全、及时、安全、无错误、不间断、持续稳定或无任何故障;
147
+ - 该模型的输出内容不构成任何建议或决策,也不保证生成的内容的真实性、完整性、准确性、及时性、合法性、功能性或实用性。生成的内容不能替代医疗、法律等领域的专业人士回答您的问题。生成的内容仅供参考,不代表华为的任何态度、立场或观点。您需要根据实际情况做出独立判断,华为不承担任何责任。
148
+
149
+ ## 7. 反馈
150
+
151
+ 如果有任何意见和建议,请提交issue或联系 openPangu@huawei.com。
152
+
153
+ ## 7. 引用
154
+
155
+ ```bibtex
156
+ @article{zhen2026dllm,
157
+ title={DLLM Agent: See Farther, Run Faster},
158
+ author={Zhen, Huiling and Lin, Weizhe and Liu, Renxi and Han, Kai and Li, Yiming and Tian, Yuchuan and Chen, Hanting and Li, Xiaoguang and Li, Xiaosong and Chen, Chen and others},
159
+ journal={arXiv preprint arXiv:2602.07451},
160
+ year={2026}
161
+ }
162
+ ```
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "PanguEmbeddedForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_pangu_embedded.PanguEmbeddedConfig",
7
+ "AutoModel": "modeling_pangu_embedded.PanguEmbeddedModel",
8
+ "AutoModelForCausalLM": "modeling_pangu_embedded.PanguEmbeddedForCausalLM"
9
+ },
10
+ "bias": true,
11
+ "attention_dropout": 0.0,
12
+ "bos_token_id": 1,
13
+ "pad_token_id": 0,
14
+ "eos_token_id": 45892,
15
+ "hidden_act": "silu",
16
+ "hidden_size": 4096,
17
+ "initializer_range": 0.02,
18
+ "intermediate_size": 12800,
19
+ "max_position_embeddings": 32768,
20
+ "model_type": "PanguEmbedded",
21
+ "num_attention_heads": 32,
22
+ "num_hidden_layers": 34,
23
+ "num_key_value_heads": 8,
24
+ "rms_norm_eps": 1e-05,
25
+ "rope_theta": 16000000.0,
26
+ "tie_word_embeddings": false,
27
+ "torch_dtype": "bfloat16",
28
+ "transformers_version": "4.53.2",
29
+ "use_cache": true,
30
+ "vocab_size": 153376
31
+ }
generation_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "do_sample": true,
4
+ "bos_token_id": 1,
5
+ "pad_token_id": 0,
6
+ "eos_token_id": 45892,
7
+ "temperature": 1.0,
8
+ "top_k": 0,
9
+ "top_p": 0.8,
10
+ "transformers_version": "4.53.2"
11
+ }
images/Context_Causal_Block_Diffusion_LLM.png ADDED
images/image.png ADDED

Git LFS Details

  • SHA256: 7511e44ecdfe80191db8895e19aebe1b7d7294b0fb4608708a0071fdb0e56b2f
  • Pointer size: 131 Bytes
  • Size of remote file: 183 kB
inference/generate.py ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
3
+ import types
4
+ import torch
5
+ try:
6
+ import torch_npu
7
+ except ImportError as e:
8
+ pass
9
+ from transformers import AutoTokenizer
10
+ from transformers import AutoModelForCausalLM, AutoTokenizer
11
+ from generation_utils import diffusion_generate
12
+
13
+ model_local_path = "path_to_openPangu-7B-Diffusion-Base"
14
+
15
+ # load the tokenizer and the model
16
+ tokenizer = AutoTokenizer.from_pretrained(
17
+ model_local_path,
18
+ use_fast=False,
19
+ trust_remote_code=True,
20
+ local_files_only=True
21
+ )
22
+
23
+ model = AutoModelForCausalLM.from_pretrained(
24
+ model_local_path,
25
+ trust_remote_code=True,
26
+ torch_dtype="auto",
27
+ device_map="npu",
28
+ local_files_only=True
29
+ )
30
+
31
+ model.diffusion_generate = types.MethodType(diffusion_generate, model)
32
+
33
+ mask_token_id = 45830
34
+ eos_token_id = tokenizer.eos_token_id
35
+
36
+ prompts = ["introduce the china", "hello",
37
+ "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. "
38
+ "How many clips did Natalia sell altogether in April and May?"]
39
+ input_ids = tokenizer(prompts, return_tensors="pt", padding=True, padding_side="left").input_ids.to(model.device)
40
+ # Create attention mask: Mark positions with non-padding tokens as True(attended), and padding tokens as False(ignored).
41
+ attention_mask = input_ids.ne(tokenizer.pad_token_id)
42
+
43
+ output = model.diffusion_generate(
44
+ input_ids,
45
+ block_length=32,
46
+ attention_mask=attention_mask,
47
+ temperature=0.0,
48
+ max_new_tokens=128,
49
+ alg="entropy",
50
+ mask_token_id=mask_token_id,
51
+ eos_token_id=eos_token_id,
52
+ num_small_blocks=4
53
+ )
54
+ generation = tokenizer.batch_decode(output[:, input_ids.shape[1]:].tolist())
55
+ generation = [x.split(tokenizer.eos_token)[0].strip() for x in generation]
56
+ print(generation)
inference/generation_utils.py ADDED
@@ -0,0 +1,313 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
2
+ # Copyright 2025 NVIDIA CORPORATION & AFFILIATES
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ #
16
+ # SPDX-License-Identifier: Apache-2.0
17
+ # Modified from Dream repos: https://github.com/HKUNLP/Dream
18
+
19
+
20
+
21
+ from dataclasses import dataclass
22
+ from collections.abc import Iterable
23
+ from typing import Any, Dict, Optional, Tuple, Union
24
+
25
+ import torch
26
+ try:
27
+ import torch_npu
28
+ except ImportError as e:
29
+ pass
30
+ import torch.distributions as dists
31
+ from torch.nn import functional as F
32
+
33
+ from transformers.cache_utils import Cache, DynamicCache
34
+
35
+
36
+ def top_p_logits(logits, top_p=None):
37
+ sorted_logits, sorted_indices = torch.sort(logits, descending=True)
38
+ cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
39
+ sorted_indices_to_remove = cumulative_probs > top_p
40
+ # Shift the indices to the right to keep the first token above the threshold
41
+ sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
42
+ sorted_indices_to_remove[..., 0] = 0
43
+
44
+ mask = torch.zeros_like(logits, dtype=torch.bool, device=logits.device)
45
+ mask = mask.scatter_(-1, sorted_indices, sorted_indices_to_remove)
46
+ logits = logits.masked_fill(mask, torch.finfo(logits.dtype).min)
47
+ return logits
48
+
49
+
50
+ def top_k_logits(logits, top_k=None):
51
+ top_k = min(top_k, logits.size(-1)) # Safety check
52
+ # Remove all tokens with a probability less than the last token of the top-k
53
+ indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
54
+ logits = logits.masked_fill(indices_to_remove, torch.finfo(logits.dtype).min)
55
+ return logits
56
+
57
+
58
+ def sample_tokens(logits, temperature=0.0, top_p=None, top_k=None, margin_confidence=False, neg_entropy=False):
59
+
60
+ if temperature > 0:
61
+ logits = logits / temperature
62
+ if top_p is not None and top_p < 1:
63
+ logits = top_p_logits(logits, top_p)
64
+ if top_k is not None:
65
+ logits = top_k_logits(logits, top_k)
66
+ probs = torch.softmax(logits, dim=-1)
67
+
68
+ if temperature > 0:
69
+ try:
70
+ x0 = dists.Categorical(probs=probs).sample()
71
+ confidence = torch.gather(probs, -1, x0.unsqueeze(-1)).squeeze(-1)
72
+ except:
73
+ confidence, x0 = probs.max(dim=-1)
74
+ else:
75
+ confidence, x0 = probs.max(dim=-1)
76
+
77
+ if margin_confidence:
78
+ sorted_probs, _ = torch.sort(probs, dim=-1, descending=True)
79
+ # Extract top1 and top2 probabilities
80
+ top1_probs = sorted_probs[:, 0]
81
+ top2_probs = sorted_probs[:, 1]
82
+ # Calculate confidence as top1 - top2
83
+ confidence = top1_probs - top2_probs
84
+
85
+ if neg_entropy:
86
+ epsilon = 1e-10
87
+ log_probs = torch.log(probs + epsilon)
88
+ confidence = torch.sum(probs * log_probs, dim=-1)
89
+
90
+ return confidence, x0
91
+
92
+
93
+ class BlockDynamicCache(DynamicCache):
94
+ """
95
+ When `skip_cache_update` is True, this class does NOT update the cached key and value states.
96
+ Instead, it concatenates the current states with the original cached states along the sequence dimension
97
+ and returns the result.
98
+
99
+ Example:
100
+
101
+ ```python
102
+ >>> past_key_values = BlockDynamicCache()
103
+ >>> past_key_values.skip_cache_update = True
104
+ >>> outputs.past_key_values
105
+ ```
106
+ """
107
+ def __init__(self, _distributed_cache_data: Optional[Iterable] = None) -> None:
108
+ """
109
+ Initialize a BlockDynamicCache instance.
110
+
111
+ skip_cache_update is False by default.
112
+ """
113
+ super().__init__(_distributed_cache_data)
114
+ self.skip_cache_update = False
115
+
116
+ def update(
117
+ self,
118
+ key_states: torch.Tensor,
119
+ value_states: torch.Tensor,
120
+ layer_idx: int,
121
+ cache_kwargs: Optional[dict[str, Any]] = None,
122
+ ) -> tuple[torch.Tensor, torch.Tensor]:
123
+ """
124
+ Updates the cache with the new `key_states` and `value_states` for the layer `layer_idx`.
125
+
126
+ Behavior depends on the `skip_cache_update` flag:
127
+ - If `skip_cache_update` is True:
128
+ * Does NOT update the stored cache.
129
+ * Concatenates the current `key_states` and `value_states`
130
+ with the original cached states along the sequence dimension.
131
+ * Returns the concatenated result.
132
+ - If `skip_cache_update` is False:
133
+ * Uses the parent class update logic to update the cache.
134
+
135
+ Parameters:
136
+ key_states (`torch.Tensor`):
137
+ The new key states to cache.
138
+ value_states (`torch.Tensor`):
139
+ The new value states to cache.
140
+ layer_idx (`int`):
141
+ The index of the layer to cache the states for.
142
+ cache_kwargs (`dict[str, Any]`, `optional`):
143
+ Additional arguments for the cache subclass. No additional arguments are used in `DynamicCache`.
144
+
145
+ Returns:
146
+ Tuple[torch.Tensor, torch.Tensor]:
147
+ The updated key and value states after concatenation or update.
148
+ When `skip_cache_update=True`, returns the concatenated tensor without modifying cache.
149
+ When `skip_cache_update=False`, returns the result from the parent class.
150
+ """
151
+ if self.skip_cache_update:
152
+ key_cache = torch.cat([self.key_cache[layer_idx], key_states], dim=-2)
153
+ value_cache = torch.cat([self.value_cache[layer_idx], value_states], dim=-2)
154
+ return key_cache, value_cache
155
+ return super().update(key_states, value_states, layer_idx, cache_kwargs)
156
+
157
+
158
+
159
+ @torch.no_grad()
160
+ def diffusion_generate(
161
+ model,
162
+ inputs: Optional[torch.Tensor] = None,
163
+ top_p: Optional[int] = None,
164
+ top_k: Optional[int] = None,
165
+ threshold: Optional[float] = 0.9,
166
+ num_small_blocks: Optional[int] = 1,
167
+ **kwargs,
168
+ ):
169
+ block_length=kwargs.pop("block_length", 32)
170
+ attention_mask = kwargs.pop("attention_mask", None)
171
+ alg = kwargs.get("alg", 'origin')
172
+ temperature = kwargs.get("temperature", 0.0)
173
+ mask_token_id = kwargs.get("mask_token_id", None)
174
+ eos_token_id = kwargs.get("eos_token_id", None)
175
+
176
+ if mask_token_id is None:
177
+ raise ValueError("mask_token_id must be provided")
178
+
179
+ if eos_token_id is None:
180
+ raise ValueError("eos_token_id must be provided")
181
+
182
+ if inputs is None:
183
+ raise ValueError("inputs must be provided")
184
+
185
+ if attention_mask is None:
186
+ raise ValueError("attention_mask must be provided")
187
+
188
+
189
+ input_ids = inputs
190
+
191
+ if type(kwargs.get('max_new_tokens', None)) is int:
192
+ max_length = kwargs.get('max_new_tokens') + input_ids.shape[-1]
193
+ elif kwargs.get('max_length', None) is None:
194
+ raise ValueError("Pass max_new_tokens or max_length")
195
+
196
+ prompt_length = input_ids.shape[1]
197
+ if (max_length - prompt_length) % block_length != 0:
198
+ raise ValueError(
199
+ f"The token length ({max_length - prompt_length}) "
200
+ f"cannot be evenly divided by the block length ({block_length})."
201
+ )
202
+
203
+ num_blocks = (max_length - prompt_length) // block_length
204
+ device = model.device
205
+ position_ids = torch.arange(max_length, device=device).unsqueeze(0)
206
+ # pad input_ids to max_length
207
+ x = F.pad(input_ids, (0, max_length - prompt_length), value=mask_token_id)
208
+
209
+ # Initialize cache for the prompt
210
+ past_key_values = BlockDynamicCache()
211
+
212
+ causal_mask = torch.tril(torch.ones(max_length, max_length, device=device, dtype=torch.bool))[None, None, :, :]
213
+
214
+ padding_mask = F.pad(attention_mask, (0, max_length - attention_mask.shape[1]), value=1.0)
215
+ position_ids = padding_mask.long().cumsum(-1) - 1
216
+ position_ids.masked_fill_(padding_mask == 0, 1)
217
+ # [B, N] --> [B, 1, N, N]
218
+ padding_mask = torch.logical_and(
219
+ padding_mask.unsqueeze(1).unsqueeze(-2),
220
+ padding_mask.unsqueeze(1).unsqueeze(-1),
221
+ )
222
+ attention_mask = padding_mask & causal_mask
223
+
224
+
225
+ # Prefill stage
226
+ if prompt_length > 0:
227
+ cur_x = x[:, :prompt_length]
228
+ cur_attn_mask = attention_mask[:, :, :prompt_length, :prompt_length]
229
+ cur_position_ids = position_ids[:, :prompt_length]
230
+ output = model(cur_x,
231
+ attention_mask=cur_attn_mask,
232
+ position_ids=cur_position_ids,
233
+ past_key_values=past_key_values,
234
+ use_cache=True
235
+ )
236
+ past_key_values = output.past_key_values
237
+
238
+ logits = output.logits[:, -1:]
239
+ confidence, x0 = sample_tokens(logits, temperature=temperature, top_p=top_p, top_k=top_k)
240
+ x[:, prompt_length:prompt_length + 1] = x0
241
+
242
+ # Process each block
243
+ for num_block in range(num_blocks):
244
+ block_start = prompt_length + num_block * block_length
245
+ block_end = prompt_length + (num_block + 1) * block_length
246
+ cur_x = x[:, block_start:block_end]
247
+ cur_attn_mask = attention_mask[:, :, block_start:block_end, :block_end]
248
+ cur_padding_mask = padding_mask[:, :, block_start:block_end, :block_end]
249
+ cur_position_ids = position_ids[:, block_start:block_end]
250
+ # Use cache for generation
251
+ small_block_length = block_length // num_small_blocks
252
+
253
+ if block_length % num_small_blocks != 0:
254
+ raise ValueError(
255
+ f"block_length ({block_length}) must be divisible by num_small_blocks ({num_small_blocks})."
256
+ )
257
+
258
+ # Just concatenates current key value states, do not update key value cache
259
+ past_key_values.skip_cache_update = True
260
+ for small_block_idx in range(num_small_blocks):
261
+ small_block_start = small_block_idx * small_block_length
262
+ small_block_end = small_block_start + small_block_length
263
+
264
+ while True:
265
+ sub_mask_index = (cur_x[:, small_block_start:small_block_end] == mask_token_id)
266
+ if sub_mask_index.sum() == 0:
267
+ break
268
+
269
+ output = model(cur_x,
270
+ attention_mask=cur_padding_mask,
271
+ position_ids=cur_position_ids,
272
+ past_key_values=past_key_values,
273
+ use_cache=True)
274
+ logits = output.logits
275
+ logits = torch.cat([logits[:, :1], logits[:, :-1]], dim=1)
276
+ logits = logits[:, small_block_start:small_block_end]
277
+
278
+ confidence, x0 = sample_tokens(logits, temperature=temperature, top_p=top_p, top_k=top_k,
279
+ neg_entropy=(alg == 'entropy'), margin_confidence=(alg == 'topk_margin'))
280
+ confidence = torch.where(sub_mask_index, confidence, -torch.inf)
281
+ transfer_index = (F.one_hot(torch.max(confidence, dim=1)[1], num_classes=small_block_length) == 1)
282
+ if alg == 'confidence_threshold':
283
+ transfer_index |= (confidence > threshold)
284
+ cur_x[:, small_block_start:small_block_end][transfer_index] = x0[transfer_index]
285
+
286
+ if eos_token_id and (x[:, prompt_length:] == eos_token_id).any(dim=1).all():
287
+ return x
288
+
289
+ # Store kv cache
290
+ past_key_values.skip_cache_update = False
291
+ output = model(cur_x,
292
+ attention_mask=cur_attn_mask,
293
+ position_ids=cur_position_ids,
294
+ past_key_values=past_key_values,
295
+ use_cache=True,
296
+ )
297
+ past_key_values = output.past_key_values
298
+ if num_block < num_blocks - 1:
299
+ logits = output.logits[:, -1:]
300
+ confidence, x0 = sample_tokens(logits, temperature=temperature, top_p=top_p, top_k=top_k)
301
+ x[:, block_end:block_end + 1] = x0
302
+
303
+ return x
304
+
305
+
306
+
307
+
308
+
309
+
310
+
311
+
312
+
313
+
model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f2083110fe3375524c42a93ae9ee7391297587af1b7f3d3748e9eff20302259
3
+ size 4926838832
model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3a50abc1ba4dd600cc61de92a22c1749be5b084116de4581dee5c3aa430249d3
3
+ size 4991682264
model-00003-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:72ef5b6232d903cb8a4904fbcae1e3318c8fc0485dc3257f1d3447eabe089053
3
+ size 4886849448
model-00004-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bb7cc286436460507161ce41c3bf65790f20cb7135c55ee76e6cb29ca2c9e9ed
3
+ size 1256456320
model.safetensors.index.json ADDED
@@ -0,0 +1,486 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 16061784576
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "model-00004-of-00004.safetensors",
7
+ "model.embed_tokens.weight": "model-00001-of-00004.safetensors",
8
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
9
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
10
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
11
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
12
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
13
+ "model.layers.0.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
14
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
15
+ "model.layers.0.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
16
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
17
+ "model.layers.0.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
18
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
19
+ "model.layers.0.self_attn.rotary_emb.inv_freq": "model-00001-of-00004.safetensors",
20
+ "model.layers.0.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
21
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
22
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
23
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
24
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
25
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
26
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
27
+ "model.layers.1.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
28
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
29
+ "model.layers.1.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
30
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
31
+ "model.layers.1.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
32
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
33
+ "model.layers.1.self_attn.rotary_emb.inv_freq": "model-00001-of-00004.safetensors",
34
+ "model.layers.1.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
35
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
36
+ "model.layers.10.input_layernorm.weight": "model-00002-of-00004.safetensors",
37
+ "model.layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
38
+ "model.layers.10.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
39
+ "model.layers.10.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
40
+ "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
41
+ "model.layers.10.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
42
+ "model.layers.10.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
43
+ "model.layers.10.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
44
+ "model.layers.10.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
45
+ "model.layers.10.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
46
+ "model.layers.10.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
47
+ "model.layers.10.self_attn.rotary_emb.inv_freq": "model-00002-of-00004.safetensors",
48
+ "model.layers.10.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
49
+ "model.layers.10.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
50
+ "model.layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
51
+ "model.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
52
+ "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
53
+ "model.layers.11.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
54
+ "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
55
+ "model.layers.11.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
56
+ "model.layers.11.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
57
+ "model.layers.11.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
58
+ "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
59
+ "model.layers.11.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
60
+ "model.layers.11.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
61
+ "model.layers.11.self_attn.rotary_emb.inv_freq": "model-00002-of-00004.safetensors",
62
+ "model.layers.11.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
63
+ "model.layers.11.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
64
+ "model.layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
65
+ "model.layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
66
+ "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
67
+ "model.layers.12.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
68
+ "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
69
+ "model.layers.12.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
70
+ "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
71
+ "model.layers.12.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
72
+ "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
73
+ "model.layers.12.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
74
+ "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
75
+ "model.layers.12.self_attn.rotary_emb.inv_freq": "model-00002-of-00004.safetensors",
76
+ "model.layers.12.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
77
+ "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
78
+ "model.layers.13.input_layernorm.weight": "model-00002-of-00004.safetensors",
79
+ "model.layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
80
+ "model.layers.13.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
81
+ "model.layers.13.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
82
+ "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
83
+ "model.layers.13.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
84
+ "model.layers.13.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
85
+ "model.layers.13.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
86
+ "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
87
+ "model.layers.13.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
88
+ "model.layers.13.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
89
+ "model.layers.13.self_attn.rotary_emb.inv_freq": "model-00002-of-00004.safetensors",
90
+ "model.layers.13.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
91
+ "model.layers.13.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
92
+ "model.layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
93
+ "model.layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
94
+ "model.layers.14.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
95
+ "model.layers.14.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
96
+ "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
97
+ "model.layers.14.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
98
+ "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
99
+ "model.layers.14.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
100
+ "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
101
+ "model.layers.14.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
102
+ "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
103
+ "model.layers.14.self_attn.rotary_emb.inv_freq": "model-00002-of-00004.safetensors",
104
+ "model.layers.14.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
105
+ "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
106
+ "model.layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
107
+ "model.layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
108
+ "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
109
+ "model.layers.15.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
110
+ "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
111
+ "model.layers.15.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
112
+ "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
113
+ "model.layers.15.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
114
+ "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
115
+ "model.layers.15.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
116
+ "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
117
+ "model.layers.15.self_attn.rotary_emb.inv_freq": "model-00002-of-00004.safetensors",
118
+ "model.layers.15.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
119
+ "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
120
+ "model.layers.16.input_layernorm.weight": "model-00002-of-00004.safetensors",
121
+ "model.layers.16.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
122
+ "model.layers.16.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
123
+ "model.layers.16.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
124
+ "model.layers.16.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
125
+ "model.layers.16.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
126
+ "model.layers.16.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
127
+ "model.layers.16.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
128
+ "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
129
+ "model.layers.16.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
130
+ "model.layers.16.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
131
+ "model.layers.16.self_attn.rotary_emb.inv_freq": "model-00002-of-00004.safetensors",
132
+ "model.layers.16.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
133
+ "model.layers.16.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
134
+ "model.layers.17.input_layernorm.weight": "model-00002-of-00004.safetensors",
135
+ "model.layers.17.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
136
+ "model.layers.17.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
137
+ "model.layers.17.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
138
+ "model.layers.17.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
139
+ "model.layers.17.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
140
+ "model.layers.17.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
141
+ "model.layers.17.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
142
+ "model.layers.17.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
143
+ "model.layers.17.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
144
+ "model.layers.17.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
145
+ "model.layers.17.self_attn.rotary_emb.inv_freq": "model-00002-of-00004.safetensors",
146
+ "model.layers.17.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
147
+ "model.layers.17.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
148
+ "model.layers.18.input_layernorm.weight": "model-00002-of-00004.safetensors",
149
+ "model.layers.18.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
150
+ "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
151
+ "model.layers.18.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
152
+ "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
153
+ "model.layers.18.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
154
+ "model.layers.18.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
155
+ "model.layers.18.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
156
+ "model.layers.18.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
157
+ "model.layers.18.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
158
+ "model.layers.18.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
159
+ "model.layers.18.self_attn.rotary_emb.inv_freq": "model-00002-of-00004.safetensors",
160
+ "model.layers.18.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
161
+ "model.layers.18.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
162
+ "model.layers.19.input_layernorm.weight": "model-00002-of-00004.safetensors",
163
+ "model.layers.19.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
164
+ "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
165
+ "model.layers.19.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
166
+ "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
167
+ "model.layers.19.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
168
+ "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
169
+ "model.layers.19.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
170
+ "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
171
+ "model.layers.19.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
172
+ "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
173
+ "model.layers.19.self_attn.rotary_emb.inv_freq": "model-00002-of-00004.safetensors",
174
+ "model.layers.19.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
175
+ "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
176
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
177
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
178
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
179
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
180
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
181
+ "model.layers.2.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
182
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
183
+ "model.layers.2.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
184
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
185
+ "model.layers.2.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
186
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
187
+ "model.layers.2.self_attn.rotary_emb.inv_freq": "model-00001-of-00004.safetensors",
188
+ "model.layers.2.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
189
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
190
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00004.safetensors",
191
+ "model.layers.20.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
192
+ "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
193
+ "model.layers.20.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
194
+ "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
195
+ "model.layers.20.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
196
+ "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
197
+ "model.layers.20.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
198
+ "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
199
+ "model.layers.20.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
200
+ "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
201
+ "model.layers.20.self_attn.rotary_emb.inv_freq": "model-00002-of-00004.safetensors",
202
+ "model.layers.20.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
203
+ "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
204
+ "model.layers.21.input_layernorm.weight": "model-00003-of-00004.safetensors",
205
+ "model.layers.21.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
206
+ "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
207
+ "model.layers.21.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
208
+ "model.layers.21.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
209
+ "model.layers.21.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
210
+ "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
211
+ "model.layers.21.self_attn.o_proj.bias": "model-00002-of-00004.safetensors",
212
+ "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
213
+ "model.layers.21.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
214
+ "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
215
+ "model.layers.21.self_attn.rotary_emb.inv_freq": "model-00002-of-00004.safetensors",
216
+ "model.layers.21.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
217
+ "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
218
+ "model.layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
219
+ "model.layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
220
+ "model.layers.22.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
221
+ "model.layers.22.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
222
+ "model.layers.22.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
223
+ "model.layers.22.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
224
+ "model.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
225
+ "model.layers.22.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
226
+ "model.layers.22.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
227
+ "model.layers.22.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
228
+ "model.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
229
+ "model.layers.22.self_attn.rotary_emb.inv_freq": "model-00003-of-00004.safetensors",
230
+ "model.layers.22.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
231
+ "model.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
232
+ "model.layers.23.input_layernorm.weight": "model-00003-of-00004.safetensors",
233
+ "model.layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
234
+ "model.layers.23.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
235
+ "model.layers.23.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
236
+ "model.layers.23.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
237
+ "model.layers.23.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
238
+ "model.layers.23.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
239
+ "model.layers.23.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
240
+ "model.layers.23.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
241
+ "model.layers.23.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
242
+ "model.layers.23.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
243
+ "model.layers.23.self_attn.rotary_emb.inv_freq": "model-00003-of-00004.safetensors",
244
+ "model.layers.23.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
245
+ "model.layers.23.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
246
+ "model.layers.24.input_layernorm.weight": "model-00003-of-00004.safetensors",
247
+ "model.layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
248
+ "model.layers.24.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
249
+ "model.layers.24.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
250
+ "model.layers.24.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
251
+ "model.layers.24.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
252
+ "model.layers.24.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
253
+ "model.layers.24.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
254
+ "model.layers.24.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
255
+ "model.layers.24.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
256
+ "model.layers.24.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
257
+ "model.layers.24.self_attn.rotary_emb.inv_freq": "model-00003-of-00004.safetensors",
258
+ "model.layers.24.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
259
+ "model.layers.24.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
260
+ "model.layers.25.input_layernorm.weight": "model-00003-of-00004.safetensors",
261
+ "model.layers.25.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
262
+ "model.layers.25.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
263
+ "model.layers.25.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
264
+ "model.layers.25.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
265
+ "model.layers.25.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
266
+ "model.layers.25.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
267
+ "model.layers.25.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
268
+ "model.layers.25.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
269
+ "model.layers.25.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
270
+ "model.layers.25.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
271
+ "model.layers.25.self_attn.rotary_emb.inv_freq": "model-00003-of-00004.safetensors",
272
+ "model.layers.25.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
273
+ "model.layers.25.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
274
+ "model.layers.26.input_layernorm.weight": "model-00003-of-00004.safetensors",
275
+ "model.layers.26.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
276
+ "model.layers.26.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
277
+ "model.layers.26.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
278
+ "model.layers.26.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
279
+ "model.layers.26.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
280
+ "model.layers.26.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
281
+ "model.layers.26.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
282
+ "model.layers.26.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
283
+ "model.layers.26.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
284
+ "model.layers.26.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
285
+ "model.layers.26.self_attn.rotary_emb.inv_freq": "model-00003-of-00004.safetensors",
286
+ "model.layers.26.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
287
+ "model.layers.26.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
288
+ "model.layers.27.input_layernorm.weight": "model-00003-of-00004.safetensors",
289
+ "model.layers.27.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
290
+ "model.layers.27.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
291
+ "model.layers.27.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
292
+ "model.layers.27.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
293
+ "model.layers.27.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
294
+ "model.layers.27.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
295
+ "model.layers.27.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
296
+ "model.layers.27.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
297
+ "model.layers.27.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
298
+ "model.layers.27.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
299
+ "model.layers.27.self_attn.rotary_emb.inv_freq": "model-00003-of-00004.safetensors",
300
+ "model.layers.27.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
301
+ "model.layers.27.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
302
+ "model.layers.28.input_layernorm.weight": "model-00003-of-00004.safetensors",
303
+ "model.layers.28.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
304
+ "model.layers.28.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
305
+ "model.layers.28.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
306
+ "model.layers.28.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
307
+ "model.layers.28.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
308
+ "model.layers.28.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
309
+ "model.layers.28.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
310
+ "model.layers.28.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
311
+ "model.layers.28.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
312
+ "model.layers.28.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
313
+ "model.layers.28.self_attn.rotary_emb.inv_freq": "model-00003-of-00004.safetensors",
314
+ "model.layers.28.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
315
+ "model.layers.28.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
316
+ "model.layers.29.input_layernorm.weight": "model-00003-of-00004.safetensors",
317
+ "model.layers.29.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
318
+ "model.layers.29.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
319
+ "model.layers.29.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
320
+ "model.layers.29.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
321
+ "model.layers.29.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
322
+ "model.layers.29.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
323
+ "model.layers.29.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
324
+ "model.layers.29.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
325
+ "model.layers.29.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
326
+ "model.layers.29.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
327
+ "model.layers.29.self_attn.rotary_emb.inv_freq": "model-00003-of-00004.safetensors",
328
+ "model.layers.29.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
329
+ "model.layers.29.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
330
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
331
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
332
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
333
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
334
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
335
+ "model.layers.3.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
336
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
337
+ "model.layers.3.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
338
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
339
+ "model.layers.3.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
340
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
341
+ "model.layers.3.self_attn.rotary_emb.inv_freq": "model-00001-of-00004.safetensors",
342
+ "model.layers.3.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
343
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
344
+ "model.layers.30.input_layernorm.weight": "model-00003-of-00004.safetensors",
345
+ "model.layers.30.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
346
+ "model.layers.30.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
347
+ "model.layers.30.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
348
+ "model.layers.30.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
349
+ "model.layers.30.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
350
+ "model.layers.30.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
351
+ "model.layers.30.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
352
+ "model.layers.30.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
353
+ "model.layers.30.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
354
+ "model.layers.30.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
355
+ "model.layers.30.self_attn.rotary_emb.inv_freq": "model-00003-of-00004.safetensors",
356
+ "model.layers.30.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
357
+ "model.layers.30.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
358
+ "model.layers.31.input_layernorm.weight": "model-00003-of-00004.safetensors",
359
+ "model.layers.31.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
360
+ "model.layers.31.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
361
+ "model.layers.31.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
362
+ "model.layers.31.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
363
+ "model.layers.31.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
364
+ "model.layers.31.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
365
+ "model.layers.31.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
366
+ "model.layers.31.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
367
+ "model.layers.31.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
368
+ "model.layers.31.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
369
+ "model.layers.31.self_attn.rotary_emb.inv_freq": "model-00003-of-00004.safetensors",
370
+ "model.layers.31.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
371
+ "model.layers.31.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
372
+ "model.layers.32.input_layernorm.weight": "model-00003-of-00004.safetensors",
373
+ "model.layers.32.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
374
+ "model.layers.32.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
375
+ "model.layers.32.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
376
+ "model.layers.32.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
377
+ "model.layers.32.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
378
+ "model.layers.32.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
379
+ "model.layers.32.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
380
+ "model.layers.32.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
381
+ "model.layers.32.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
382
+ "model.layers.32.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
383
+ "model.layers.32.self_attn.rotary_emb.inv_freq": "model-00003-of-00004.safetensors",
384
+ "model.layers.32.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
385
+ "model.layers.32.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
386
+ "model.layers.33.input_layernorm.weight": "model-00003-of-00004.safetensors",
387
+ "model.layers.33.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
388
+ "model.layers.33.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
389
+ "model.layers.33.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
390
+ "model.layers.33.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
391
+ "model.layers.33.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
392
+ "model.layers.33.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
393
+ "model.layers.33.self_attn.o_proj.bias": "model-00003-of-00004.safetensors",
394
+ "model.layers.33.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
395
+ "model.layers.33.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
396
+ "model.layers.33.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
397
+ "model.layers.33.self_attn.rotary_emb.inv_freq": "model-00003-of-00004.safetensors",
398
+ "model.layers.33.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
399
+ "model.layers.33.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
400
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00004.safetensors",
401
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
402
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
403
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
404
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
405
+ "model.layers.4.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
406
+ "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
407
+ "model.layers.4.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
408
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
409
+ "model.layers.4.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
410
+ "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
411
+ "model.layers.4.self_attn.rotary_emb.inv_freq": "model-00001-of-00004.safetensors",
412
+ "model.layers.4.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
413
+ "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
414
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00004.safetensors",
415
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
416
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
417
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
418
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
419
+ "model.layers.5.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
420
+ "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
421
+ "model.layers.5.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
422
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
423
+ "model.layers.5.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
424
+ "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
425
+ "model.layers.5.self_attn.rotary_emb.inv_freq": "model-00001-of-00004.safetensors",
426
+ "model.layers.5.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
427
+ "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
428
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00004.safetensors",
429
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
430
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
431
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
432
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
433
+ "model.layers.6.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
434
+ "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
435
+ "model.layers.6.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
436
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
437
+ "model.layers.6.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
438
+ "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
439
+ "model.layers.6.self_attn.rotary_emb.inv_freq": "model-00001-of-00004.safetensors",
440
+ "model.layers.6.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
441
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
442
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00004.safetensors",
443
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
444
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
445
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
446
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
447
+ "model.layers.7.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
448
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
449
+ "model.layers.7.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
450
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
451
+ "model.layers.7.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
452
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
453
+ "model.layers.7.self_attn.rotary_emb.inv_freq": "model-00001-of-00004.safetensors",
454
+ "model.layers.7.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
455
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
456
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00004.safetensors",
457
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
458
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
459
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
460
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
461
+ "model.layers.8.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
462
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
463
+ "model.layers.8.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
464
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
465
+ "model.layers.8.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
466
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
467
+ "model.layers.8.self_attn.rotary_emb.inv_freq": "model-00001-of-00004.safetensors",
468
+ "model.layers.8.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
469
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
470
+ "model.layers.9.input_layernorm.weight": "model-00002-of-00004.safetensors",
471
+ "model.layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
472
+ "model.layers.9.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
473
+ "model.layers.9.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
474
+ "model.layers.9.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
475
+ "model.layers.9.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
476
+ "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
477
+ "model.layers.9.self_attn.o_proj.bias": "model-00001-of-00004.safetensors",
478
+ "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
479
+ "model.layers.9.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
480
+ "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
481
+ "model.layers.9.self_attn.rotary_emb.inv_freq": "model-00001-of-00004.safetensors",
482
+ "model.layers.9.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
483
+ "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
484
+ "model.norm.weight": "model-00003-of-00004.safetensors"
485
+ }
486
+ }
modeling_openpangu_dense.py ADDED
@@ -0,0 +1,576 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
2
+ # This file was automatically generated from modular_openpangu_dense.py.
3
+ # Do NOT edit this file manually as any edits will be overwritten by the generation of
4
+ # the file from the modular. If any change should be done, please apply the change to the
5
+ # modular_openpangu_dense.py file directly. One of our CI enforces this.
6
+ # 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
7
+
8
+ # coding=utf-8
9
+ # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
10
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All Rights Reserved.
11
+ #
12
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
13
+ # and OPT implementations in this library. It has been modified from its
14
+ # original forms to accommodate minor architectural differences compared
15
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
16
+ #
17
+ # Licensed under the Apache License, Version 2.0 (the "License");
18
+ # you may not use this file except in compliance with the License.
19
+ # You may obtain a copy of the License at
20
+ #
21
+ # http://www.apache.org/licenses/LICENSE-2.0
22
+ #
23
+ # Unless required by applicable law or agreed to in writing, software
24
+ # distributed under the License is distributed on an "AS IS" BASIS,
25
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
26
+ # See the License for the specific language governing permissions and
27
+ # limitations under the License.
28
+
29
+ from typing import Callable, Optional, Union
30
+
31
+ import torch
32
+ from torch import nn
33
+ try:
34
+ import torch_npu
35
+ from torch_npu.contrib import transfer_to_npu
36
+ if "910" in torch.npu.get_device_name():
37
+ NPU_ATTN_INFR = True
38
+ print("[INFO] torch_npu detected. Using NPU fused infer attention.")
39
+ else:
40
+ NPU_ATTN_INFR = False
41
+ except ImportError as e:
42
+ NPU_ATTN_INFR = False
43
+
44
+
45
+ from transformers.activations import ACT2FN
46
+ from transformers.cache_utils import Cache, DynamicCache
47
+ from transformers.generation import GenerationMixin
48
+ from transformers.masking_utils import create_causal_mask
49
+ from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
50
+ from transformers.modeling_layers import GradientCheckpointingLayer
51
+ from transformers.modeling_outputs import (
52
+ BaseModelOutputWithPast,
53
+ CausalLMOutputWithPast,
54
+ SequenceClassifierOutputWithPast,
55
+ )
56
+ from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
57
+ from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
58
+ from transformers.processing_utils import Unpack
59
+ from transformers.utils import LossKwargs, auto_docstring, can_return_tuple, logging
60
+ from .configuration_openpangu_dense import PanguEmbeddedConfig
61
+
62
+ logger = logging.get_logger(__name__)
63
+
64
+
65
+ class PanguEmbeddedRMSNorm(nn.Module):
66
+ def __init__(self, hidden_size, eps=1e-6):
67
+ """
68
+ PanguEmbeddedRMSNorm is equivalent to T5LayerNorm
69
+ """
70
+ super().__init__()
71
+ self.weight = nn.Parameter(torch.ones(hidden_size))
72
+ self.variance_epsilon = eps
73
+
74
+ def forward(self, hidden_states):
75
+ input_dtype = hidden_states.dtype
76
+ hidden_states = hidden_states.to(torch.float32)
77
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
78
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
79
+ return self.weight * hidden_states.to(input_dtype)
80
+
81
+ def extra_repr(self):
82
+ return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
83
+
84
+
85
+ class PanguEmbeddedRotaryEmbedding(nn.Module):
86
+ def __init__(self, config: PanguEmbeddedConfig, device=None):
87
+ super().__init__()
88
+ # BC: "rope_type" was originally "type"
89
+ if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
90
+ self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
91
+ else:
92
+ self.rope_type = "default"
93
+ self.max_seq_len_cached = config.max_position_embeddings
94
+ self.original_max_seq_len = config.max_position_embeddings
95
+
96
+ self.config = config
97
+ self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
98
+
99
+ inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
100
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
101
+ self.original_inv_freq = self.inv_freq
102
+
103
+ @torch.no_grad()
104
+ @dynamic_rope_update # power user: used with advanced RoPE types (e.g. dynamic rope)
105
+ def forward(self, x, position_ids):
106
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
107
+ position_ids_expanded = position_ids[:, None, :].float()
108
+
109
+ device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
110
+ with torch.autocast(device_type=device_type, enabled=False): # Force float32
111
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
112
+ emb = torch.cat((freqs, freqs), dim=-1)
113
+ cos = emb.cos() * self.attention_scaling
114
+ sin = emb.sin() * self.attention_scaling
115
+
116
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
117
+
118
+
119
+ def rotate_half(x):
120
+ """Rotates half the hidden dims of the input."""
121
+ x1 = x[..., : x.shape[-1] // 2]
122
+ x2 = x[..., x.shape[-1] // 2 :]
123
+ return torch.cat((-x2, x1), dim=-1)
124
+
125
+
126
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
127
+ """Applies Rotary Position Embedding to the query and key tensors.
128
+
129
+ Args:
130
+ q (`torch.Tensor`): The query tensor.
131
+ k (`torch.Tensor`): The key tensor.
132
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
133
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
134
+ position_ids (`torch.Tensor`, *optional*):
135
+ Deprecated and unused.
136
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
137
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
138
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
139
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
140
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
141
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
142
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
143
+ Returns:
144
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
145
+ """
146
+ cos = cos.unsqueeze(unsqueeze_dim)
147
+ sin = sin.unsqueeze(unsqueeze_dim)
148
+ q_embed = (q * cos) + (rotate_half(q) * sin)
149
+ k_embed = (k * cos) + (rotate_half(k) * sin)
150
+ return q_embed, k_embed
151
+
152
+
153
+ class PanguEmbeddedMLP(nn.Module):
154
+ def __init__(self, config):
155
+ super().__init__()
156
+ self.config = config
157
+ self.hidden_size = config.hidden_size
158
+ self.intermediate_size = config.intermediate_size
159
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
160
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
161
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
162
+ self.act_fn = ACT2FN[config.hidden_act]
163
+
164
+ def forward(self, x):
165
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
166
+ return down_proj
167
+
168
+
169
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
170
+ """
171
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
172
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
173
+ """
174
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
175
+ if n_rep == 1:
176
+ return hidden_states
177
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
178
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
179
+
180
+
181
+ def eager_attention_forward(
182
+ module: nn.Module,
183
+ query: torch.Tensor,
184
+ key: torch.Tensor,
185
+ value: torch.Tensor,
186
+ attention_mask: Optional[torch.Tensor],
187
+ scaling: float,
188
+ dropout: float = 0.0,
189
+ **kwargs,
190
+ ):
191
+ key_states = repeat_kv(key, module.num_key_value_groups)
192
+ value_states = repeat_kv(value, module.num_key_value_groups)
193
+
194
+ attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
195
+ if attention_mask is not None:
196
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
197
+ attn_weights = attn_weights + causal_mask
198
+
199
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
200
+ attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
201
+ attn_output = torch.matmul(attn_weights, value_states)
202
+ attn_output = attn_output.transpose(1, 2).contiguous()
203
+
204
+ return attn_output, attn_weights
205
+
206
+
207
+ class PanguEmbeddedAttention(nn.Module):
208
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
209
+
210
+ def __init__(self, config: PanguEmbeddedConfig, layer_idx: int):
211
+ super().__init__()
212
+ self.config = config
213
+ self.layer_idx = layer_idx
214
+ self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
215
+ self.num_heads = config.num_attention_heads
216
+ self.num_key_value_heads = config.num_key_value_heads
217
+ self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
218
+ self.scaling = self.head_dim**-0.5
219
+ self.attention_dropout = config.attention_dropout
220
+ self.is_causal = True
221
+
222
+ self.q_proj = nn.Linear(config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.bias)
223
+ self.k_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.bias)
224
+ self.v_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.bias)
225
+ self.o_proj = nn.Linear(config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.bias)
226
+
227
+ def forward(
228
+ self,
229
+ hidden_states: torch.Tensor,
230
+ position_embeddings: tuple[torch.Tensor, torch.Tensor],
231
+ attention_mask: Optional[torch.Tensor],
232
+ past_key_value: Optional[Cache] = None,
233
+ cache_position: Optional[torch.LongTensor] = None,
234
+ **kwargs: Unpack[FlashAttentionKwargs],
235
+ ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
236
+ input_shape = hidden_states.shape[:-1]
237
+ hidden_shape = (*input_shape, -1, self.head_dim)
238
+
239
+ query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
240
+ key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
241
+ value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
242
+
243
+ cos, sin = position_embeddings
244
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
245
+
246
+ if past_key_value is not None:
247
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
248
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
249
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
250
+
251
+ attention_interface: Callable = eager_attention_forward
252
+ if self.config._attn_implementation != "eager":
253
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
254
+
255
+ if not self.training and NPU_ATTN_INFR:
256
+ q_len = input_shape[1]
257
+ if attention_mask is not None:
258
+ attention_mask = ~attention_mask.bool()
259
+
260
+ attn_output, _ = torch_npu.npu_fused_infer_attention_score(
261
+ query_states, key_states, value_states,
262
+ num_heads=self.num_heads, num_key_value_heads=self.num_key_value_heads,
263
+ input_layout="BNSD", atten_mask=attention_mask, scale=self.scaling)
264
+ attn_output = attn_output.transpose(1, 2)
265
+ attn_weights = None
266
+ else:
267
+ attn_output, attn_weights = attention_interface(
268
+ self,
269
+ query_states,
270
+ key_states,
271
+ value_states,
272
+ attention_mask,
273
+ dropout=0.0 if not self.training else self.attention_dropout,
274
+ scaling=self.scaling,
275
+ **kwargs,
276
+ )
277
+
278
+ attn_output = attn_output.reshape(*input_shape, -1).contiguous()
279
+ attn_output = self.o_proj(attn_output)
280
+ return attn_output, attn_weights
281
+
282
+
283
+ class PanguEmbeddedDecoderLayer(GradientCheckpointingLayer):
284
+ def __init__(self, config: PanguEmbeddedConfig, layer_idx: int):
285
+ super().__init__()
286
+ self.hidden_size = config.hidden_size
287
+ self.self_attn = PanguEmbeddedAttention(config=config, layer_idx=layer_idx)
288
+ self.mlp = PanguEmbeddedMLP(config)
289
+ self.input_layernorm = PanguEmbeddedRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
290
+ self.post_attention_layernorm = PanguEmbeddedRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
291
+
292
+ def forward(
293
+ self,
294
+ hidden_states: torch.Tensor,
295
+ attention_mask: Optional[torch.Tensor] = None,
296
+ position_ids: Optional[torch.LongTensor] = None,
297
+ past_key_value: Optional[Cache] = None,
298
+ output_attentions: Optional[bool] = False,
299
+ use_cache: Optional[bool] = False,
300
+ cache_position: Optional[torch.LongTensor] = None,
301
+ position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
302
+ **kwargs: Unpack[FlashAttentionKwargs],
303
+ ) -> tuple[torch.FloatTensor, Optional[tuple[torch.FloatTensor, torch.FloatTensor]]]:
304
+ residual = hidden_states
305
+ hidden_states = self.input_layernorm(hidden_states)
306
+
307
+ # Self Attention
308
+ hidden_states, self_attn_weights = self.self_attn(
309
+ hidden_states=hidden_states,
310
+ attention_mask=attention_mask,
311
+ position_ids=position_ids,
312
+ past_key_value=past_key_value,
313
+ output_attentions=output_attentions,
314
+ use_cache=use_cache,
315
+ cache_position=cache_position,
316
+ position_embeddings=position_embeddings,
317
+ **kwargs,
318
+ )
319
+ hidden_states = residual + hidden_states
320
+
321
+ # Fully Connected
322
+ residual = hidden_states
323
+ hidden_states = self.post_attention_layernorm(hidden_states)
324
+ hidden_states = self.mlp(hidden_states)
325
+ hidden_states = residual + hidden_states
326
+
327
+ outputs = (hidden_states,)
328
+ if output_attentions:
329
+ outputs += (self_attn_weights,)
330
+
331
+ return outputs
332
+
333
+
334
+ @auto_docstring
335
+ class PanguEmbeddedPreTrainedModel(PreTrainedModel):
336
+ config_class = PanguEmbeddedConfig
337
+ base_model_prefix = "model"
338
+ supports_gradient_checkpointing = True
339
+ _no_split_modules = ["PanguEmbeddedDecoderLayer"]
340
+ _skip_keys_device_placement = ["past_key_values"]
341
+ _supports_flash_attn_3 = True
342
+ _supports_flash_attn_2 = True
343
+ _supports_sdpa = True
344
+ _supports_flex_attn = True
345
+ _supports_cache_class = True
346
+ _supports_quantized_cache = True
347
+ _supports_static_cache = True
348
+ _supports_attention_backend = True
349
+
350
+ def _init_weights(self, module):
351
+ std = self.config.initializer_range
352
+ if isinstance(module, nn.Linear):
353
+ module.weight.data.normal_(mean=0.0, std=std)
354
+ if module.bias is not None:
355
+ module.bias.data.zero_()
356
+ elif isinstance(module, nn.Embedding):
357
+ module.weight.data.normal_(mean=0.0, std=std)
358
+ if module.padding_idx is not None:
359
+ module.weight.data[module.padding_idx].zero_()
360
+ elif isinstance(module, PanguEmbeddedRMSNorm):
361
+ module.weight.data.fill_(1.0)
362
+
363
+
364
+ @auto_docstring
365
+ class PanguEmbeddedModel(PanguEmbeddedPreTrainedModel):
366
+ def __init__(self, config: PanguEmbeddedConfig):
367
+ super().__init__(config)
368
+ self.padding_idx = config.pad_token_id
369
+ self.vocab_size = config.vocab_size
370
+
371
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
372
+ self.layers = nn.ModuleList(
373
+ [PanguEmbeddedDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
374
+ )
375
+ self.norm = PanguEmbeddedRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
376
+ self.rotary_emb = PanguEmbeddedRotaryEmbedding(config=config)
377
+ self.gradient_checkpointing = False
378
+
379
+ # Initialize weights and apply final processing
380
+ self.post_init()
381
+
382
+ def get_input_embeddings(self):
383
+ return self.embed_tokens
384
+
385
+ def set_input_embeddings(self, value):
386
+ self.embed_tokens = value
387
+
388
+ @can_return_tuple
389
+ @auto_docstring
390
+ def forward(
391
+ self,
392
+ input_ids: Optional[torch.LongTensor] = None,
393
+ attention_mask: Optional[torch.Tensor] = None,
394
+ position_ids: Optional[torch.LongTensor] = None,
395
+ past_key_values: Optional[Cache] = None,
396
+ inputs_embeds: Optional[torch.FloatTensor] = None,
397
+ use_cache: Optional[bool] = None,
398
+ output_attentions: Optional[bool] = None,
399
+ output_hidden_states: Optional[bool] = None,
400
+ cache_position: Optional[torch.LongTensor] = None,
401
+ **flash_attn_kwargs: Unpack[FlashAttentionKwargs],
402
+ ) -> BaseModelOutputWithPast:
403
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
404
+ output_hidden_states = (
405
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
406
+ )
407
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
408
+
409
+ if (input_ids is None) ^ (inputs_embeds is not None):
410
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
411
+
412
+ if self.gradient_checkpointing and self.training and use_cache:
413
+ logger.warning_once(
414
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
415
+ )
416
+ use_cache = False
417
+
418
+ # TODO (joao): remove this exception in v4.56 -- it exists for users that try to pass a legacy cache
419
+ if not isinstance(past_key_values, (type(None), Cache)):
420
+ raise ValueError("The `past_key_values` should be either a `Cache` object or `None`.")
421
+
422
+ if inputs_embeds is None:
423
+ inputs_embeds = self.embed_tokens(input_ids)
424
+
425
+ if use_cache and past_key_values is None:
426
+ past_key_values = DynamicCache()
427
+
428
+ if cache_position is None:
429
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
430
+ cache_position = torch.arange(
431
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
432
+ )
433
+
434
+ if position_ids is None:
435
+ position_ids = cache_position.unsqueeze(0)
436
+
437
+ hidden_states = inputs_embeds
438
+
439
+ # create position embeddings to be shared across the decoder layers
440
+ position_embeddings = self.rotary_emb(hidden_states, position_ids)
441
+
442
+ # decoder layers
443
+ all_hidden_states = () if output_hidden_states else None
444
+ all_self_attns = () if output_attentions else None
445
+
446
+ for decoder_layer in self.layers[: self.config.num_hidden_layers]:
447
+ if output_hidden_states:
448
+ all_hidden_states += (hidden_states,)
449
+
450
+ layer_outputs = decoder_layer(
451
+ hidden_states,
452
+ attention_mask=attention_mask,
453
+ position_ids=position_ids,
454
+ past_key_value=past_key_values,
455
+ output_attentions=output_attentions,
456
+ use_cache=use_cache,
457
+ cache_position=cache_position,
458
+ position_embeddings=position_embeddings,
459
+ **flash_attn_kwargs,
460
+ )
461
+
462
+ hidden_states = layer_outputs[0]
463
+
464
+ if output_attentions:
465
+ all_self_attns += (layer_outputs[1],)
466
+
467
+ hidden_states = self.norm(hidden_states)
468
+
469
+ # add hidden states from the last decoder layer
470
+ if output_hidden_states:
471
+ all_hidden_states += (hidden_states,)
472
+
473
+ return BaseModelOutputWithPast(
474
+ last_hidden_state=hidden_states,
475
+ past_key_values=past_key_values if use_cache else None,
476
+ hidden_states=all_hidden_states,
477
+ attentions=all_self_attns,
478
+ )
479
+
480
+
481
+ class KwargsForCausalLM(FlashAttentionKwargs, LossKwargs): ...
482
+
483
+
484
+ @auto_docstring
485
+ class PanguEmbeddedForCausalLM(PanguEmbeddedPreTrainedModel):
486
+ _tied_weights_keys = ["lm_head.weight"]
487
+ _tp_plan = {"lm_head": "colwise_rep"}
488
+ _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
489
+
490
+ def __init__(self, config):
491
+ super().__init__(config)
492
+ self.model = PanguEmbeddedModel(config)
493
+ self.vocab_size = config.vocab_size
494
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
495
+
496
+ # Initialize weights and apply final processing
497
+ self.post_init()
498
+
499
+ def get_input_embeddings(self):
500
+ return self.model.embed_tokens
501
+
502
+ def set_input_embeddings(self, value):
503
+ self.model.embed_tokens = value
504
+
505
+ def get_output_embeddings(self):
506
+ return self.lm_head
507
+
508
+ def set_output_embeddings(self, new_embeddings):
509
+ self.lm_head = new_embeddings
510
+
511
+ def set_decoder(self, decoder):
512
+ self.model = decoder
513
+
514
+ def get_decoder(self):
515
+ return self.model
516
+
517
+ @can_return_tuple
518
+ @auto_docstring
519
+ def forward(
520
+ self,
521
+ input_ids: Optional[torch.LongTensor] = None,
522
+ attention_mask: Optional[torch.Tensor] = None,
523
+ position_ids: Optional[torch.LongTensor] = None,
524
+ past_key_values: Optional[Cache] = None,
525
+ inputs_embeds: Optional[torch.FloatTensor] = None,
526
+ labels: Optional[torch.LongTensor] = None,
527
+ use_cache: Optional[bool] = None,
528
+ output_attentions: Optional[bool] = None,
529
+ output_hidden_states: Optional[bool] = None,
530
+ cache_position: Optional[torch.LongTensor] = None,
531
+ logits_to_keep: Union[int, torch.Tensor] = 0,
532
+ **kwargs: Unpack[KwargsForCausalLM],
533
+ ) -> CausalLMOutputWithPast:
534
+
535
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
536
+ output_hidden_states = (
537
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
538
+ )
539
+
540
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
541
+ outputs: BaseModelOutputWithPast = self.model(
542
+ input_ids=input_ids,
543
+ attention_mask=attention_mask,
544
+ position_ids=position_ids,
545
+ past_key_values=past_key_values,
546
+ inputs_embeds=inputs_embeds,
547
+ use_cache=use_cache,
548
+ output_attentions=output_attentions,
549
+ output_hidden_states=output_hidden_states,
550
+ cache_position=cache_position,
551
+ **kwargs,
552
+ )
553
+
554
+ hidden_states = outputs.last_hidden_state
555
+ # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
556
+ slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
557
+ logits = self.lm_head(hidden_states[:, slice_indices, :])
558
+
559
+ loss = None
560
+ if labels is not None:
561
+ loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
562
+
563
+ return CausalLMOutputWithPast(
564
+ loss=loss,
565
+ logits=logits,
566
+ past_key_values=outputs.past_key_values,
567
+ hidden_states=outputs.hidden_states,
568
+ attentions=outputs.attentions,
569
+ )
570
+
571
+
572
+ __all__ = [
573
+ "PanguEmbeddedForCausalLM",
574
+ "PanguEmbeddedModel",
575
+ "PanguEmbeddedPreTrainedModel",
576
+ ]
modular_openpangu_dense.py ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
3
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All Rights Reserved.
4
+ #
5
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
6
+ # and OPT implementations in this library. It has been modified from its
7
+ # original forms to accommodate minor architectural differences compared
8
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
9
+ #
10
+ # Licensed under the Apache License, Version 2.0 (the "License");
11
+ # you may not use this file except in compliance with the License.
12
+ # You may obtain a copy of the License at
13
+ #
14
+ # http://www.apache.org/licenses/LICENSE-2.0
15
+ #
16
+ # Unless required by applicable law or agreed to in writing, software
17
+ # distributed under the License is distributed on an "AS IS" BASIS,
18
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
19
+ # See the License for the specific language governing permissions and
20
+ # limitations under the License.
21
+
22
+ from typing import Callable, Optional, Tuple
23
+
24
+ import torch
25
+ from torch import nn
26
+
27
+ import torch_npu
28
+ from torch_npu.contrib import transfer_to_npu
29
+ if "910" in torch.npu.get_device_name():
30
+ NPU_ATTN_INFR = True
31
+ print("[INFO] torch_npu detected. Using NPU fused infer attention.")
32
+ else:
33
+ NPU_ATTN_INFR = False
34
+
35
+ from transformers.cache_utils import Cache
36
+ from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
37
+ from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
38
+ from transformers.processing_utils import Unpack
39
+ from transformers.utils import logging
40
+ from transformers.models.llama.modeling_llama import (
41
+ LlamaAttention,
42
+ LlamaDecoderLayer,
43
+ LlamaForCausalLM,
44
+ LlamaForSequenceClassification,
45
+ LlamaMLP,
46
+ LlamaModel,
47
+ apply_rotary_pos_emb,
48
+ eager_attention_forward,
49
+ )
50
+ from .configuration_openpangu_dense import PanguEmbeddedConfig
51
+
52
+
53
+ logger = logging.get_logger(__name__)
54
+
55
+
56
+ class PanguEmbeddedMLP(LlamaMLP):
57
+ def __init__(self, config):
58
+ super().__init__(config)
59
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
60
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
61
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
62
+
63
+
64
+ class PanguEmbeddedAttention(LlamaAttention):
65
+ def __init__(self, config: PanguEmbeddedConfig, layer_idx: int):
66
+ super().__init__()
67
+ self.config = config
68
+ self.layer_idx = layer_idx
69
+ self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
70
+ self.num_heads = config.num_attention_heads
71
+ self.num_key_value_heads = config.num_key_value_heads
72
+ self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
73
+ self.scaling = self.head_dim**-0.5
74
+ self.attention_dropout = config.attention_dropout
75
+ self.is_causal = True
76
+
77
+ self.q_proj = nn.Linear(config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.bias)
78
+ self.k_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.bias)
79
+ self.v_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.bias)
80
+ self.o_proj = nn.Linear(config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.bias)
81
+
82
+ def forward(
83
+ self,
84
+ hidden_states: torch.Tensor,
85
+ position_embeddings: tuple[torch.Tensor, torch.Tensor],
86
+ attention_mask: Optional[torch.Tensor],
87
+ past_key_value: Optional[Cache] = None,
88
+ cache_position: Optional[torch.LongTensor] = None,
89
+ **kwargs: Unpack[FlashAttentionKwargs],
90
+ ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
91
+ input_shape = hidden_states.shape[:-1]
92
+ hidden_shape = (*input_shape, -1, self.head_dim)
93
+
94
+ query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
95
+ key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
96
+ value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
97
+
98
+ cos, sin = position_embeddings
99
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
100
+
101
+ if past_key_value is not None:
102
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
103
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
104
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
105
+
106
+ attention_interface: Callable = eager_attention_forward
107
+ if self.config._attn_implementation != "eager":
108
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
109
+
110
+ if not self.training and NPU_ATTN_INFR:
111
+ q_len = input_shape[1]
112
+ if attention_mask is not None:
113
+ attention_mask = ~attention_mask.bool()
114
+ elif q_len > 1:
115
+ attention_mask = torch.triu(torch.ones([q_len, q_len]), diagonal=1).bool().unsqueeze(0).unsqueeze(0).to(query_states.device)
116
+
117
+ attn_output, _ = torch_npu.npu_fused_infer_attention_score(
118
+ query_states, key_states, value_states,
119
+ num_heads=self.num_heads, num_key_value_heads=self.num_key_value_heads,
120
+ input_layout="BNSD", atten_mask=attention_mask, scale=self.scaling)
121
+ attn_output = attn_output.transpose(1, 2)
122
+ attn_weights = None
123
+ else:
124
+ attn_output, attn_weights = attention_interface(
125
+ self,
126
+ query_states,
127
+ key_states,
128
+ value_states,
129
+ attention_mask,
130
+ dropout=0.0 if not self.training else self.attention_dropout,
131
+ scaling=self.scaling,
132
+ **kwargs,
133
+ )
134
+
135
+ attn_output = attn_output.reshape(*input_shape, -1).contiguous()
136
+ attn_output = self.o_proj(attn_output)
137
+ return attn_output, attn_weights
138
+
139
+
140
+ class PanguEmbeddedDecoderLayer(LlamaDecoderLayer):
141
+ pass
142
+
143
+
144
+ class PanguEmbeddedModel(LlamaModel):
145
+ pass
146
+
147
+
148
+ class PanguEmbeddedForCausalLM(LlamaForCausalLM):
149
+ pass
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenization_openpangu.py ADDED
@@ -0,0 +1,273 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
3
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All Rights Reserved.
4
+ #
5
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
6
+ # and OPT implementations in this library. It has been modified from its
7
+ # original forms to accommodate minor architectural differences compared
8
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
9
+ #
10
+ # Licensed under the Apache License, Version 2.0 (the "License");
11
+ # you may not use this file except in compliance with the License.
12
+ # You may obtain a copy of the License at
13
+ #
14
+ # http://www.apache.org/licenses/LICENSE-2.0
15
+ #
16
+ # Unless required by applicable law or agreed to in writing, software
17
+ # distributed under the License is distributed on an "AS IS" BASIS,
18
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
19
+ # See the License for the specific language governing permissions and
20
+ # limitations under the License.
21
+
22
+ import os
23
+ from shutil import copyfile
24
+ from typing import Any, Dict, List, Optional, Tuple
25
+
26
+ import sentencepiece as spm
27
+
28
+ from transformers.tokenization_utils import PreTrainedTokenizer
29
+ from transformers.utils import logging
30
+
31
+
32
+ logger = logging.get_logger(__name__)
33
+
34
+ VOCAB_FILES_NAMES = {"vocab_file": "./tokenizer.model"}
35
+
36
+ PRETRAINED_VOCAB_FILES_MAP = {}
37
+
38
+
39
+ def convert_bool(string):
40
+ if isinstance(string, str):
41
+ if string.lower() == "true":
42
+ return True
43
+ elif string.lower() == "false":
44
+ return False
45
+ else:
46
+ return string
47
+ else:
48
+ return string
49
+
50
+
51
+ class PanguTokenizer(PreTrainedTokenizer):
52
+ """
53
+ Construct a tokenizer. Based on byte-level Byte-Pair-Encoding.
54
+
55
+ Args:
56
+ vocab_file (`str`):
57
+ Path to the vocabulary file.
58
+ """
59
+
60
+ vocab_files_names = VOCAB_FILES_NAMES
61
+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
62
+ model_input_names = ["input_ids", "attention_mask"]
63
+ _auto_class = "AutoTokenizer"
64
+
65
+ def __init__(
66
+ self,
67
+ vocab_file,
68
+ unk_token="<unk>",
69
+ bos_token="<s>",
70
+ eos_token="</s>",
71
+ pad_token="</s>",
72
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
73
+ add_bos_token=True,
74
+ add_eos_token=False,
75
+ decode_with_prefix_space=False,
76
+ clean_up_tokenization_spaces=False,
77
+ **kwargs,
78
+ ):
79
+ self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
80
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
81
+ self.sp_model.Load(vocab_file)
82
+ super().__init__(
83
+ bos_token=bos_token,
84
+ eos_token=eos_token,
85
+ unk_token=unk_token,
86
+ pad_token=pad_token,
87
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
88
+ **kwargs,
89
+ )
90
+ self.vocab_file = vocab_file
91
+ self.add_bos_token = convert_bool(add_bos_token)
92
+ self.add_eos_token = add_eos_token
93
+ self.decode_with_prefix_space = decode_with_prefix_space
94
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
95
+ self.sp_model.Load(vocab_file)
96
+ self._no_prefix_space_tokens = None
97
+
98
+ """ Initialisation"""
99
+
100
+ @property
101
+ def no_prefix_space_tokens(self):
102
+ if self._no_prefix_space_tokens is None:
103
+ vocab = self.convert_ids_to_tokens(list(range(self.vocab_size)))
104
+ self._no_prefix_space_tokens = {i for i, tok in enumerate(vocab) if not tok.startswith("▁")}
105
+ return self._no_prefix_space_tokens
106
+
107
+ @property
108
+ def vocab_size(self):
109
+ """Returns vocab size"""
110
+ return self.sp_model.get_piece_size()
111
+
112
+ @property
113
+ def bos_token_id(self) -> Optional[int]:
114
+ return self.sp_model.bos_id()
115
+
116
+ @property
117
+ def eos_token_id(self) -> Optional[int]:
118
+ return super().eos_token_id
119
+
120
+ def get_vocab(self):
121
+ """Returns vocab as a dict"""
122
+ vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
123
+ vocab.update(self.added_tokens_encoder)
124
+ return vocab
125
+
126
+ def _tokenize(self, text):
127
+ """Returns a tokenized string."""
128
+ return self.sp_model.encode(text, out_type=str)
129
+
130
+ def _convert_token_to_id(self, token):
131
+ """Converts a token (str) in an id using the vocab."""
132
+ return self.sp_model.piece_to_id(token)
133
+
134
+ def _convert_id_to_token(self, index):
135
+ """Converts an index (integer) in a token (str) using the vocab."""
136
+ token = self.sp_model.IdToPiece(index)
137
+ return token
138
+
139
+ def _maybe_add_prefix_space(self, tokens, decoded):
140
+ if tokens and tokens[0] not in self.no_prefix_space_tokens:
141
+ return " " + decoded
142
+ else:
143
+ return decoded
144
+
145
+ def convert_tokens_to_string(self, tokens):
146
+ """Converts a sequence of tokens (string) in a single string."""
147
+ current_sub_tokens = []
148
+ out_string = ""
149
+ prev_is_special = False
150
+ for token in tokens:
151
+ # make sure that special tokens are not decoded using sentencepiece model
152
+ if token in self.all_special_tokens:
153
+ # Decode the current sub-tokens first
154
+ if current_sub_tokens:
155
+ out_string += self.sp_model.decode(current_sub_tokens)
156
+ current_sub_tokens = []
157
+ # Append the special token without adding extra spaces
158
+ out_string += token
159
+ prev_is_special = True
160
+ else:
161
+ current_sub_tokens.append(token)
162
+ prev_is_special = False
163
+ # Decode any remaining sub-tokens
164
+ if current_sub_tokens:
165
+ out_string += self.sp_model.decode(current_sub_tokens)
166
+ # Clean up leading and trailing spaces
167
+ if self.clean_up_tokenization_spaces:
168
+ out_string = self.clean_up_tokenization(out_string)
169
+ out_string = self._maybe_add_prefix_space(tokens=tokens, decoded=out_string)
170
+ return out_string[1:]
171
+
172
+ # Override decode to set spaces_between_special_tokens to True as default
173
+ def decode(self,
174
+ token_ids,
175
+ spaces_between_special_tokens: bool = False,
176
+ **kwargs):
177
+ return super().decode(
178
+ token_ids=token_ids,
179
+ spaces_between_special_tokens=spaces_between_special_tokens,
180
+ **kwargs,
181
+ )
182
+
183
+ def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> Tuple[str]:
184
+ """
185
+ Save the vocabulary and special tokens file to a directory.
186
+
187
+ Args:
188
+ save_directory (`str`):
189
+ The directory in which to save the vocabulary.
190
+
191
+ Returns:
192
+ `Tuple(str)`: Paths to the files saved.
193
+ """
194
+ if not os.path.isdir(save_directory):
195
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
196
+ return ("",)
197
+ out_vocab_file = os.path.join(
198
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
199
+ )
200
+
201
+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
202
+ copyfile(self.vocab_file, out_vocab_file)
203
+ elif not os.path.isfile(self.vocab_file):
204
+ with open(out_vocab_file, "wb") as fi:
205
+ content_spiece_model = self.sp_model.serialized_model_proto()
206
+ fi.write(content_spiece_model)
207
+
208
+ return (out_vocab_file,)
209
+
210
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
211
+ if self.add_bos_token:
212
+ bos_token_ids = [self.bos_token_id]
213
+ else:
214
+ bos_token_ids = []
215
+
216
+ output = bos_token_ids + token_ids_0
217
+
218
+ if token_ids_1 is not None:
219
+ output = output + token_ids_1
220
+
221
+ if self.add_eos_token:
222
+ output = output + [self.eos_token_id]
223
+
224
+ return output
225
+
226
+ def get_special_tokens_mask(
227
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
228
+ ) -> List[int]:
229
+ """
230
+ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
231
+ special tokens using the tokenizer `prepare_for_model` method.
232
+
233
+ Args:
234
+ token_ids_0 (`List[int]`):
235
+ List of IDs.
236
+ token_ids_1 (`List[int]`, *optional*):
237
+ Optional second list of IDs for sequence pairs.
238
+ already_has_special_tokens (`bool`, *optional*, defaults to `False`):
239
+ Whether or not the token list is already formatted with special tokens for the model.
240
+
241
+ Returns:
242
+ `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
243
+ """
244
+ if already_has_special_tokens:
245
+ return super().get_special_tokens_mask(
246
+ token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
247
+ )
248
+
249
+ if token_ids_1 is None:
250
+ return [1] + ([0] * len(token_ids_0)) + [1]
251
+ return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
252
+
253
+ def create_token_type_ids_from_sequences(
254
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
255
+ ) -> List[int]:
256
+ """
257
+ Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make
258
+ use of token type ids, therefore a list of zeros is returned.
259
+
260
+ Args:
261
+ token_ids_0 (`List[int]`):
262
+ List of IDs.
263
+ token_ids_1 (`List[int]`, *optional*):
264
+ Optional second list of IDs for sequence pairs.
265
+
266
+ Returns:
267
+ `List[int]`: List of zeros.
268
+ """
269
+ eos = [self.eos_token_id]
270
+
271
+ if token_ids_1 is None:
272
+ return len(token_ids_0 + eos) * [0]
273
+ return len(token_ids_0 + eos + token_ids_1 + eos) * [0]
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b16f1558c0cd4ae6ef1a2c605713be0a514f50e1ce2d2c878979ce988c148ec
3
+ size 2477809
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"add_bos_token": true, "add_eos_token": false, "add_prefix_space": true, "added_tokens_decoder": {"0": {"content": "<unk>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "1": {"content": "<s>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "2": {"content": "</s>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45806": {"content": "<|User|>:", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45813": {"content": "<|Bot|>:", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45830": {"content": "[unused0]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45840": {"content": "[unused1]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45846": {"content": "[unused2]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45849": {"content": "[unused3]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45861": {"content": "[unused4]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45866": {"content": "[unused5]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45874": {"content": "[unused6]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45883": {"content": "[unused7]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45884": {"content": "[unused8]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45887": {"content": "[unused9]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45892": {"content": "[unused10]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45920": {"content": "[unused11]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45932": {"content": "[unused12]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45938": {"content": "[unused13]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45953": {"content": "[unused14]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45968": {"content": "[unused15]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45974": {"content": "[unused16]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45982": {"content": "[unused17]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "45986": {"content": "[unused18]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46005": {"content": "[unused19]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46007": {"content": "[unused20]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46014": {"content": "[unused21]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46017": {"content": "[unused22]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46028": {"content": "[unused23]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46032": {"content": "[unused24]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46081": {"content": "[unused25]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46086": {"content": "[unused26]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46101": {"content": "[unused27]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46183": {"content": "[unused28]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46230": {"content": "[unused29]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46245": {"content": "[unused30]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "46257": {"content": "[unused31]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "144208": {"content": "[unused32]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}, "144209": {"content": "[unused33]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}}, "auto_map": {"AutoTokenizer": ["tokenization_openpangu.PanguTokenizer", null]}, "bos_token": "<s>", "clean_up_tokenization_spaces": false, "eos_token": "</s>", "legacy": true, "model_max_length": 1000000000000000019884624838656, "pad_token": "<unk>", "sp_model_kwargs": {}, "spaces_between_special_tokens": false, "tokenizer_class": "PanguTokenizer", "unk_token": "<unk>", "use_default_system_prompt": false, "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '[unused9]系统:[unused10]' }}{% endif %}{% if message['role'] == 'system' %}{{ '[unused9]系统:' + message['content'] + '[unused10]' }}{% endif %}{% if message['role'] == 'assistant' %}{{'[unused9]助手:' + message['content'] + '[unused10]'}}{% endif %}{% if message['role'] == 'tool' %}{{'[unused9]工具:' + message['content'] + '[unused10]'}}{% endif %}{% if message['role'] == 'function' %}{{'[unused9]方法:' + message['content'] + '[unused10]'}}{% endif %}{% if message['role'] == 'user' %}{{'[unused9]用户:' + message['content'] + '[unused10]'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '[unused9]助手:' }}{% endif %}"}