zancato commited on
Commit
b5e2df3
·
0 Parent(s):

Super-squash branch 'main' using huggingface_hub

Browse files
.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,259 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - hybrid
7
+ - ssm
8
+ - state-space-model
9
+ - linear-attention
10
+ - gated-kalmanet
11
+ - priming
12
+ - long-context
13
+ - reasoning
14
+ base_model: Qwen/Qwen3-32B
15
+ paper:
16
+ - https://arxiv.org/abs/2511.21016
17
+ ---
18
+
19
+
20
+ # GKA-primed-HQwen3-32B-Reasoner
21
+
22
+ GKA-primed-HQwen3-32B-Reasoner is a Hybrid language model consisting of 50% Attention layers and 50% [Gated KalmaNet (GKA)](https://arxiv.org/abs/2511.21016) layers, primed from [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) using the [Hybrid Model Factory](https://github.com/awslabs/hybrid-model-factory) Priming pipeline. The model is trained for long-context reasoning and supports context lengths of 128K tokens.
23
+
24
+ GKA (pronounced as gee-ka) is a State-Space Model layer inspired by the Kalman Filter that solves an online ridge regression problem at test time, with constant
25
+ memory and linear compute cost in the sequence length.
26
+
27
+ By combining Attention with GKA, our Hybrid model achieves up to **2× faster inference** at long contexts while **closely matching the base Transformer's quality**.
28
+
29
+ ## Links
30
+
31
+ - 📄 [Gated KalmaNet paper (CVPR 2026)](https://arxiv.org/abs/2511.21016)
32
+ - 💻 [GitHub: Hybrid Model Factory](https://github.com/awslabs/hybrid-model-factory)
33
+
34
+
35
+ ## Why Hybrid?
36
+
37
+ Each Primed Hybrid model is initialized from a base Transformer by converting a portion of its Attention layers into State-Space Model (SSM) layers that maintain a fixed-size recurrent state instead of a growing KV cache. At a 50% Hybrid ratio, roughly half the KV cache (which grows linearly with sequence length) is replaced with fixed-size SSM state. The practical benefits:
38
+
39
+ - **Higher throughput at long contexts** — less memory on KV cache means more memory for batching
40
+ - **More concurrent sequences** — ~2× as many concurrent sequences before hitting memory limits
41
+ - **Growing advantage with context length** — at long contexts, Attention dominates the forward pass while SSM layers remain negligible in cost. Since the Hybrid model makes roughly half as many Attention calls as the base Transformer, the throughput advantage grows with context length
42
+
43
+ Increasing hybridization ratio, replacing more Attention layers with SSM layers, further reduces memory and increases throughput, typically at the expense of performance.
44
+
45
+ ## Model Overview
46
+
47
+ - **Type**: Causal Language Model (Hybrid Attention + SSM)
48
+ - **Base Model**: Qwen3-32B
49
+ - **Hybrid Layer Type**: Gated KalmaNet (GKA)
50
+ - **Hybrid Ratio**: 50% (32 Attention + 32 GKA layers)
51
+ - **Parameters**: ~32B
52
+ - **Context Length**: 128K natively
53
+ - **Precision**: bfloat16
54
+ - **License**: Apache 2.0
55
+
56
+ ## Benchmark Results
57
+
58
+ We consider the following Transformer as a baseline:
59
+
60
+ 1. **Qwen3-32B (thinking, from HF)**: The original Qwen model evaluated in thinking mode, which is the intended mode for reasoning tasks. This serves as the base Transformer from which we start the Priming procedure.
61
+
62
+ ### Reasoning Benchmarks
63
+
64
+ Evaluations on math reasoning (AIME24/25), science (GPQA), coding (LiveCodeBenchv5, Scicode), tool-calling (BFCLv3/v4), and instruction-following (IFBench). Evaluations are done using the [Nemo Evaluator SDK](https://docs.nvidia.com/nemo/evaluator/latest/). We have provided the evaluation configuration [examples/evaluation/nemo_reasoning_evals.yaml](https://github.com/awslabs/hybrid-model-factory/blob/main/examples/evaluation/nemo_reasoning_evals.yaml) for reproducibility. Evaluations are done at 64K generation length.
65
+
66
+ | Model | AIME24 | AIME25 | GPQA | LiveCodeBench-v5 | BFCLv4 (minus web-search) | BFCLv3 | IFBench | SciCode | Average |
67
+ |-------|------|-----------|--------|-----------|----------|------|----|-----|-----|
68
+ | Qwen3-32B (thinking, from HF) | 86.33 | 70.00 | 65.40 | 64.44 | 69.30 | 69.57 | 32.61 | 15.94 | 59.20 |
69
+ | GKA-primed-HQwen3-32B-Reasoner | 87.67 | 81.67 | 67.30 | 70.24 | 70.14 | 66.34 | 48.22 | 12.34 | 62.99 |
70
+
71
+
72
+ *For BFCLv4, we remove the web-search subtask and weight each task by the number of entries (test examples) for that task:* \\(\text{Overall Accuracy} = \sum_i \left(\text{accuracy}_i \times \text{num\_entries}_i\right) / \sum_i \text{num\_entries}_i\\)
73
+
74
+ **How close is the Hybrid model to the Transformer baseline on complex reasoning tasks?**
75
+ Our Primed GKA Hybrid **outperforms the Qwen3-32B (thinking, from HF) baseline** by ~3.8 points on average, despite using [<0.5% of the base Transformer's pre-training token budget](#training-data). This is enabled by our multi-stage reasoning training pipeline coupled with higher throughput of Hybrid architectures.
76
+
77
+
78
+ ## About Gated KalmaNet (GKA)
79
+
80
+ Gated KalmaNet is a recently proposed SSM layer that is more expressive than both Mamba2 and Gated DeltaNet. GKA achieves this by employing the Kalman Filter to compute the optimal state at each time-step based on the entire past. In contrast, SSMs like Mamba2 and GDN rely on instantaneous objectives (that rely *solely* on the current input and loss estimate of the past) to compute their state.
81
+
82
+ Unlike other SSM-based hybrid layers, GKA gives you a runtime knob for trading compute against speed — with no retraining nor architecture changes. The `num_iter` parameter controls how many iterations the GKA solver runs during inference. No other hybrid layer type offers this: GDN and Mamba2 have fixed compute per layer, so their speed is fixed a priori. GKA lets you slide along the compute–latency curve per deployment, making it uniquely suited for scenarios where different endpoints or traffic tiers have different latency budgets.
83
+
84
+ For details on controlling GKA's compute–speed tradeoff at serving time via `num_iter`, see [GKA Compute Control](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#gka-compute-control-num_iter), and for more details on the modelling choices see the [GKA paper](https://arxiv.org/abs/2511.21016).
85
+
86
+ This release includes optimized Triton kernels for GKA's Chebyshev solver, enabling the throughput numbers reported in [Inference Efficiency](#inference-efficiency). The training kernels are in [`training/.../gated_kalmanet/ops/chebyshev/`](https://github.com/awslabs/hybrid-model-factory/tree/main/training/src/hmf/model/hybrid_zoo/layers/gated_kalmanet/ops/chebyshev) and the inference kernels in [`vllm-inference/.../gka/ops/`](https://github.com/awslabs/hybrid-model-factory/tree/main/vllm-inference/src/primed_vllm/gka/ops).
87
+
88
+ ### Architecture Details
89
+
90
+ | Component | Details |
91
+ |-----------|------------------------------------------------------------------------------|
92
+ | Number of Layers | 64 (32 Attention + 32 GKA) |
93
+ | Hidden Dimension | 5120 |
94
+ | Attention Heads | 64 (Q) / 8 (KV) |
95
+ | Head Dimension | 128 |
96
+ | Intermediate Dimension (FFN) | 25600 |
97
+ | Vocabulary Size | 151,936 |
98
+ | Position Encoding | RoPE (θ = 5,000,000) |
99
+ | Layer Layout | GKA layer indices were selected with our [*selective hybridization*](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/LayerSelection.md) procedure |
100
+
101
+ ### Trade-off inference FLOPs for accuracy.
102
+ As discussed above, GKA offers the unique ability to adjust the inference FLOPs by tuning the `num_iter` parameter. Here we summarize reasoning performance across different `num_iter` settings.
103
+
104
+ | Model | Avg. Reasoning Performance |
105
+ |---------------------------------------------------------|----------------------------|
106
+ | GKA-primed-HQwen3-32B-Reasoner (`num_iter=30`, default) | 62.99 |
107
+ | GKA-primed-HQwen3-32B-Reasoner (`num_iter=10`) | 62.63 |
108
+ | GKA-primed-HQwen3-32B-Reasoner (`num_iter=5`) | 61.77 |
109
+ | GKA-primed-HQwen3-32B-Reasoner (`num_iter=1`) | 60.04 |
110
+ | Qwen3-32B (thinking, from HF) | 59.20 |
111
+
112
+ For most practical scenarios, we recommend setting `num_iter=10` for the best trade-off. See next section for inference gains upon reducing number of iterations.
113
+
114
+ > [!NOTE]
115
+ > Interestingly, setting `num_iter=0` effectively converts the GKA model to a Gated Linear Attention (GLA) model. Thus, one can think of increasing num iters as improving upon the initial solution of the GLA model.
116
+
117
+ ### Inference Efficiency
118
+
119
+ Sustained decode throughput (tokens/s) on 8× H200 GPUs (TP=8), measured during pure decode with a saturated KV cache. Benchmarked with random data (no prefix-caching benefits). See the full [Inference guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#performance-benchmarks) for methodology and additional models.
120
+
121
+ | Model | 16K | 32K | 64K | 128K |
122
+ |-----------------------------------------------------------|---------------|---------------|---------------|---------------|
123
+ | GKA-primed-HQwen3-32B-Reasoner (`num_iter=30`, default) | 6,810 (1.29×) | 4,152 (1.45×) | 2,385 (1.82×) | 1,168 (1.99×) |
124
+ | GKA-primed-HQwen3-32B-Reasoner (`num_iter=10`) | 7,778 (1.47×) | 4,534 (1.58×) | 2,537 (1.94×) | 1,200 (2.05×) |
125
+ | GKA-primed-HQwen3-32B-Reasoner (`num_iter=5`) | 8,039 (1.52×) | 4,621 (1.61×) | 2,569 (1.96×) | 1,206 (2.06×) |
126
+ | GKA-primed-HQwen3-32B-Reasoner (`num_iter=1`) | 8,177 (1.54×) | 4,678 (1.63×) | 2,593 (1.98×) | 1,210 (2.06×) |
127
+ | GDN-primed-HQwen3-32B | 8,133 (1.53×) | 4,876 (1.70×) | 2,688 (2.06×) | 1,238 (2.11×) |
128
+ | Qwen3-32B (thinking, from HF) | 5,299 | 2,865 | 1,308 | 586 |
129
+
130
+ Mean TTFT at the Transformer's saturated batch size (Hybrid model has memory to spare):
131
+
132
+ | Model | 16K | 32K | 64K | 128K |
133
+ |-----------------------------------------------------------|-------------------|-------------------|-------------------|-------------------|
134
+ | GKA-primed-HQwen3-32B-Reasoner (`num_iter=30`, default) | 52,053 ms (1.32×) | 58,613 ms (1.21×) | 68,241 ms (1.05×) | 84,935 ms (0.90×) |
135
+ | GKA-primed-HQwen3-32B-Reasoner (`num_iter=10`) | 48,560 ms (1.23×) | 55,039 ms (1.13×) | 64,766 ms (0.99×) | 81,410 ms (0.86×) |
136
+ | GKA-primed-HQwen3-32B-Reasoner (`num_iter=5`) | 47,958 ms (1.22×) | 54,320 ms (1.12×) | 63,826 ms (0.98×) | 80,369 ms (0.85×) |
137
+ | GKA-primed-HQwen3-32B-Reasoner (`num_iter=1`) | 46,726 ms (1.19×) | 53,061 ms (1.09×) | 62,645 ms (0.96×) | 79,321 ms (0.84×) |
138
+ | GDN-primed-HQwen3-32B | 42,492 ms (1.08×) | 48,417 ms (1.00×) | 57,525 ms (0.88×) | 73,145 ms (0.77×) |
139
+ | Qwen3-32B (thinking, from HF) | 39,421 ms | 48,527 ms | 65,104 ms | 94,479 ms |
140
+
141
+ The decode throughput advantage grows with context length — from 1.29× at 16K to 1.99× at 128K (2.06× with `num_iter=1`) — thanks to GKA layers maintaining a fixed-size recurrent state instead of a growing KV cache. TTFT crosses over at long contexts: GKA prefills 10–16% faster than the Transformer at 128K depending on `num_iter`. Reducing `num_iter` progressively improves both decode and TTFT, with the effect more pronounced at 32B than 8B. See [Trade-off inference FLOPs for accuracy](#trade-off-inference-flops-for-accuracy) for details.
142
+
143
+
144
+ ## Usage
145
+
146
+ ### With vLLM (recommended)
147
+
148
+ Install the [Hybrid Model Factory vLLM plugin](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#docker-recommended) in your local environment, then serve:
149
+
150
+ ```bash
151
+ vllm serve amazon/GKA-primed-HQwen3-32B-Reasoner \
152
+ --enable-prefix-caching \
153
+ --mamba-cache-mode align \
154
+ --mamba-cache-dtype float32 \
155
+ --mamba-ssm-cache-dtype float32 \
156
+ --enable-auto-tool-choice \
157
+ --tool-call-parser hermes \
158
+ --reasoning-parser qwen3
159
+ ```
160
+
161
+ Query the server:
162
+
163
+ ```bash
164
+ curl http://localhost:8000/v1/chat/completions \
165
+ -H "Content-Type: application/json" \
166
+ -d '{
167
+ "model": "amazon/GKA-primed-HQwen3-32B-Reasoner",
168
+ "messages": [
169
+ {"role": "user", "content": "What is Linear Attention in the context of LLMs?"}
170
+ ],
171
+ "temperature": 1.0,
172
+ "top_p": 1.0
173
+ }'
174
+ ```
175
+
176
+ > [!TIP]
177
+ > The `--mamba-cache-dtype float32` and `--mamba-ssm-cache-dtype float32` flags are important for accurate long-context generation. See the [Inference guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#recommended-flags-for-hybrid-models) for details on all recommended flags.
178
+
179
+ > [!TIP]
180
+ > Similarly to [NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16), for generic reasoning tasks (e.g. Math, Science) we recommend setting `temperature=1.0` and `top_p=1.0`. For tool-calling we recommend `temperature=0.6`, `top_p=0.95`.
181
+
182
+ #### Thinking Versus Non-thinking Setting
183
+
184
+ Our reasoning model supports thinking on/off modes. Whenever thinking mode is on, the model will reason for multiple tokens in a segment delimited by `<think>` and `</think>` (which is extracted by the reasoning parser) before producing a response. This is necessary for difficult queries and increases response quality at the expense of higher latency. Thinking mode is enabled by default, however thinking mode can be turned off via the chat template.
185
+
186
+ If you want to query the model with thinking mode *off*, query the model as follows:
187
+
188
+ ```bash
189
+ curl http://localhost:8000/v1/chat/completions \
190
+ -H "Content-Type: application/json" \
191
+ -d '{
192
+ "model": "amazon/GKA-primed-HQwen3-32B-Reasoner",
193
+ "messages": [
194
+ {"role": "user", "content": "What is Linear Attention in the context of LLMs?"}
195
+ ],
196
+ "chat_template_kwargs": {"enable_thinking": false}
197
+ }'
198
+ ```
199
+
200
+ ### With HuggingFace Transformers
201
+
202
+ > [!WARNING]
203
+ > Due to the long generations produced by reasoning models, the lower latency provided by vLLM is preferred over Huggingface for evaluations and in production settings. We recommend Huggingface generation primarily for quick debugging or testing.
204
+
205
+ ```python
206
+ from transformers import AutoModelForCausalLM, AutoTokenizer
207
+ import hmf.model.hybrid_zoo.models.model_register # Register Hybrid models
208
+
209
+ model = AutoModelForCausalLM.from_pretrained(
210
+ "amazon/GKA-primed-HQwen3-32B-Reasoner", trust_remote_code=True
211
+ ).to("cuda")
212
+ tokenizer = AutoTokenizer.from_pretrained("amazon/GKA-primed-HQwen3-32B-Reasoner")
213
+
214
+ messages = [{"role": "user", "content": "What is linear attention in the context of LLMs?"}]
215
+ prompt = tokenizer.apply_chat_template(
216
+ messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
217
+ )
218
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
219
+ outputs = model.generate(**inputs, max_new_tokens=65536, temperature=1.0, top_p=1.0)
220
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
221
+ ```
222
+
223
+ In order to turn thinking mode off, simply specify `enable_thinking=False` when applying the chat template:
224
+ ```python
225
+ messages = [{"role": "user", "content": "What is linear attention in the context of LLMs?"}]
226
+ prompt = tokenizer.apply_chat_template(
227
+ messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
228
+ )
229
+ ```
230
+
231
+ ## Training data
232
+ These models were produced through the multi-stage Priming pipeline from [Hybrid Model Factory](https://github.com/awslabs/hybrid-model-factory). Training data spans web documents, mathematics, long-context documents, and instruction-following and reasoning examples — each targeting a different capability axis. This diversity is critical: it allows the Priming procedure to convert a base Transformer into a more memory- and compute-efficient Hybrid architecture at nearly the same level of performance, using < 0.5% of the base Transformer model's pre-training token budget.
233
+
234
+
235
+ ## Responsible AI Considerations
236
+ At Amazon, we are committed to developing AI responsibly and take a people-centric approach that prioritizes education, science, and our customers, to integrate responsible AI across the end-to-end AI lifecycle. We believe the use of AI must respect the rule of law and human rights, and we encourage the safe and responsible development of AI. When downloaded or used in accordance with [AWS Responsible AI Policy](https://aws.amazon.com/ai/responsible-ai/policy/), developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
237
+ Please report model quality, risk, security vulnerabilities or Amazon AI Concerns [here](https://pages.awscloud.com/global-ln-gc-400-ai-service-cards-contact-us-registration.html).
238
+
239
+ ## Citation
240
+
241
+ ```bibtex
242
+ @software{hybrid_model_factory,
243
+ title = {Hybrid Model Factory},
244
+ year = {2026},
245
+ url = {https://github.com/awslabs/hybrid-model-factory}
246
+ }
247
+
248
+ @inproceedings{gka2026,
249
+ title = {Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression},
250
+ year = {2026},
251
+ booktitle = {CVPR},
252
+ url = {https://arxiv.org/abs/2511.21016}
253
+ }
254
+
255
+ ```
256
+
257
+ ## License
258
+
259
+ This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
chat_template.jinja ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {{- messages[0].content + '\n' }}
5
+ {%- endif %}
6
+ {%- if enable_thinking is defined and enable_thinking is false %}
7
+ {{- 'detailed thinking off\n\n' }}
8
+ {%- else %}
9
+ {{- 'detailed thinking on\n\n' }}
10
+ {%- endif %}
11
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
12
+ {%- for tool in tools %}
13
+ {{- "\n" }}
14
+ {{- tool | tojson }}
15
+ {%- endfor %}
16
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>" }}
17
+ {{- '<|im_end|>\n' }}
18
+ {%- else %}
19
+ {{- '<|im_start|>system\n' }}
20
+ {%- if messages[0].role == 'system' %}
21
+ {{- messages[0].content + '\n' }}
22
+ {%- endif %}
23
+ {%- if enable_thinking is defined and enable_thinking is false %}
24
+ {{- 'detailed thinking off' }}
25
+ {%- else %}
26
+ {{- 'detailed thinking on' }}
27
+ {%- endif %}
28
+ {{- '<|im_end|>\n' }}
29
+ {%- endif %}
30
+ {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
31
+ {%- for message in messages[::-1] %}
32
+ {%- set index = (messages|length - 1) - loop.index0 %}
33
+ {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
34
+ {%- set ns.multi_step_tool = false %}
35
+ {%- set ns.last_query_index = index %}
36
+ {%- endif %}
37
+ {%- endfor %}
38
+ {%- for message in messages %}
39
+ {%- if message.content is string %}
40
+ {%- set content = message.content %}
41
+ {%- else %}
42
+ {%- set content = '' %}
43
+ {%- endif %}
44
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
45
+ {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
46
+ {%- elif message.role == "assistant" %}
47
+ {%- set reasoning_content = '' %}
48
+ {%- if message.reasoning_content is string %}
49
+ {%- set reasoning_content = message.reasoning_content %}
50
+ {%- else %}
51
+ {%- if '</think>' in content %}
52
+ {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
53
+ {%- set content = content.split('</think>')[-1].lstrip('\n') %}
54
+ {%- endif %}
55
+ {%- endif %}
56
+ {%- if loop.index0 > ns.last_query_index %}
57
+ {%- if loop.last or (not loop.last and reasoning_content) %}
58
+ {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
59
+ {%- else %}
60
+ {{- '<|im_start|>' + message.role + '\n' + content }}
61
+ {%- endif %}
62
+ {%- else %}
63
+ {{- '<|im_start|>' + message.role + '\n' + content }}
64
+ {%- endif %}
65
+ {%- if message.tool_calls %}
66
+ {%- for tool_call in message.tool_calls %}
67
+ {%- if (loop.first and content) or (not loop.first) %}
68
+ {{- '\n' }}
69
+ {%- endif %}
70
+ {%- if tool_call.function %}
71
+ {%- set tool_call = tool_call.function %}
72
+ {%- endif %}
73
+ {{- '<tool_call>\n{"name": "' }}
74
+ {{- tool_call.name }}
75
+ {{- '", "arguments": ' }}
76
+ {%- if tool_call.arguments is string %}
77
+ {{- tool_call.arguments }}
78
+ {%- else %}
79
+ {{- tool_call.arguments | tojson }}
80
+ {%- endif %}
81
+ {{- '}\n</tool_call>' }}
82
+ {%- endfor %}
83
+ {%- endif %}
84
+ {{- '<|im_end|>\n' }}
85
+ {%- elif message.role == "tool" %}
86
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
87
+ {{- '<|im_start|>user' }}
88
+ {%- endif %}
89
+ {{- '\n<tool_response>\n' }}
90
+ {{- content }}
91
+ {{- '\n</tool_response>' }}
92
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
93
+ {{- '<|im_end|>\n' }}
94
+ {%- endif %}
95
+ {%- endif %}
96
+ {%- endfor %}
97
+ {%- if add_generation_prompt %}
98
+ {{- '<|im_start|>assistant\n' }}
99
+ {%- if enable_thinking is defined and enable_thinking is false %}
100
+ {{- '<think>\n\n</think>\n\n' }}
101
+ {%- endif %}
102
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "HybridQwen3ForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 151643,
8
+ "dtype": "bfloat16",
9
+ "eos_token_id": 151645,
10
+ "gka_config": {
11
+ "bp_lambda": true,
12
+ "chunk_size": 64,
13
+ "conv_size": 4,
14
+ "gla_rescale": true,
15
+ "head_dim": 128,
16
+ "hidden_size": 5120,
17
+ "norm_eps": 1e-06,
18
+ "num_iter": 30,
19
+ "num_k_heads": 8,
20
+ "num_q_heads": 64,
21
+ "num_v_heads": 8,
22
+ "ridge_strength": 0.02,
23
+ "solver_type": "chebyshev",
24
+ "use_alpha_connection": true,
25
+ "use_beta_gate": true,
26
+ "use_forgetting_gate": true,
27
+ "use_forgetting_gate_kk": true,
28
+ "use_gate": true,
29
+ "use_v_conv": true
30
+ },
31
+ "head_dim": 128,
32
+ "hidden_act": "silu",
33
+ "hidden_size": 5120,
34
+ "hybrid_override_pattern": "*-*-GKA-GKA-*-*-*-GKA-GKA-GKA-GKA-GKA-GKA-GKA-GKA-GKA-GKA-*-*-*-GKA-*-GKA-GKA-GKA-*-GKA-GKA-GKA-GKA-*-*-GKA-GKA-*-GKA-GKA-*-*-GKA-GKA-GKA-*-GKA-GKA-*-*-*-*-*-*-*-*-*-*-*-*-GKA-*-*-GKA-*-*-GKA",
35
+ "initializer_range": 0.02,
36
+ "intermediate_size": 25600,
37
+ "layer_types": [
38
+ "*",
39
+ "*",
40
+ "GKA",
41
+ "GKA",
42
+ "*",
43
+ "*",
44
+ "*",
45
+ "GKA",
46
+ "GKA",
47
+ "GKA",
48
+ "GKA",
49
+ "GKA",
50
+ "GKA",
51
+ "GKA",
52
+ "GKA",
53
+ "GKA",
54
+ "GKA",
55
+ "*",
56
+ "*",
57
+ "*",
58
+ "GKA",
59
+ "*",
60
+ "GKA",
61
+ "GKA",
62
+ "GKA",
63
+ "*",
64
+ "GKA",
65
+ "GKA",
66
+ "GKA",
67
+ "GKA",
68
+ "*",
69
+ "*",
70
+ "GKA",
71
+ "GKA",
72
+ "*",
73
+ "GKA",
74
+ "GKA",
75
+ "*",
76
+ "*",
77
+ "GKA",
78
+ "GKA",
79
+ "GKA",
80
+ "*",
81
+ "GKA",
82
+ "GKA",
83
+ "*",
84
+ "*",
85
+ "*",
86
+ "*",
87
+ "*",
88
+ "*",
89
+ "*",
90
+ "*",
91
+ "*",
92
+ "*",
93
+ "*",
94
+ "*",
95
+ "GKA",
96
+ "*",
97
+ "*",
98
+ "GKA",
99
+ "*",
100
+ "*",
101
+ "GKA"
102
+ ],
103
+ "max_position_embeddings": 131072,
104
+ "max_window_layers": 64,
105
+ "model_type": "hybrid_qwen3",
106
+ "num_attention_heads": 64,
107
+ "num_hidden_layers": 64,
108
+ "num_key_value_heads": 8,
109
+ "pad_token_id": null,
110
+ "rms_norm_eps": 1e-06,
111
+ "rope_parameters": {
112
+ "rope_theta": 5000000,
113
+ "rope_type": "default"
114
+ },
115
+ "sliding_window": null,
116
+ "tie_word_embeddings": false,
117
+ "transformers_version": "5.3.0",
118
+ "use_cache": false,
119
+ "use_sliding_window": false,
120
+ "vocab_size": 151936
121
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 151643,
4
+ "eos_token_id": 151645,
5
+ "transformers_version": "4.51.3",
6
+ "use_cache": false
7
+ }
model-00001-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:478cec5089c8f65c97ebdfcbba77f82bda3c26a3892ee0a49995010b4542db14
3
+ size 4829518760
model-00002-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b475b8aea354cc8ac54ee23ca0fe44f7dfcbe6c81b97ad1364ea1bd3d92ffa9a
3
+ size 4785716232
model-00003-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c35391cbea643bec0be9c48c7c95b2a2f777f574e8ecdbf0f58ff82cd88accd
3
+ size 4768821432
model-00004-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6cb6ff1a978dfae5947c5f0b3f4587eb40335e08547470b13c3da1d484f9bd2f
3
+ size 4970102496
model-00005-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:684b9dffd9d8cc64df58a62b374ab3a0a61dd11b547ad504592d16a60f4ac376
3
+ size 4773200928
model-00006-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9ad9e54dce88233680e44c7ff8fc99f12a21326e4f239fe150f041bd1d99aa97
3
+ size 4945030664
model-00007-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:71d5076ceee0f0c0e30fdb093bcfba3a0b0c05cfbc49dff433c73d323dc097c4
3
+ size 4884166672
model-00008-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:40452ed9c358aa4f265a21076b6e6b1a0f8104d12b24d8d60004bd76d0a51a2e
3
+ size 4945071672
model-00009-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dc3d567871f9920acd67c20823b4a3a13d1500e889165f45b4d2aed9432ea5c7
3
+ size 4871651688
model-00010-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:99b013d4b035929154f1cacd3ad36cb7d7b5a43289dfccb99abe66568d1930e7
3
+ size 4871610264
model-00011-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:40d49a3c53df9a3972f57f329c40eff612b840b928417f35780b72e6f11025fc
3
+ size 4875989720
model-00012-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f5e12c0ff19cc950cc76a7416a9927916285f4ac4b304361a4ed5b84c58b3d5a
3
+ size 4875989720
model-00013-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6eb490d766b6b254b5478020df5096dd96874fb0f63fad5a4d1fc0576cfba00f
3
+ size 4773200928
model-00014-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9db6929d470db5d6abccde0d233cc02794de60585faa7983278bb11ae072905b
3
+ size 3548363864
model-00015-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:32a177f7e42cdf5d80fa75d653b87965c786f8f0458164d4c754b62e8171d670
3
+ size 1555824736
model.safetensors.index.json ADDED
@@ -0,0 +1,1034 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 68274145280
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "model-00015-of-00015.safetensors",
7
+ "model.embed_tokens.weight": "model-00001-of-00015.safetensors",
8
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00015.safetensors",
9
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00015.safetensors",
10
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00015.safetensors",
11
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00015.safetensors",
12
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00015.safetensors",
13
+ "model.layers.0.self_attn.k_norm.weight": "model-00001-of-00015.safetensors",
14
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00015.safetensors",
15
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00015.safetensors",
16
+ "model.layers.0.self_attn.q_norm.weight": "model-00001-of-00015.safetensors",
17
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00015.safetensors",
18
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00015.safetensors",
19
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00015.safetensors",
20
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00015.safetensors",
21
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00015.safetensors",
22
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00015.safetensors",
23
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00015.safetensors",
24
+ "model.layers.1.self_attn.k_norm.weight": "model-00001-of-00015.safetensors",
25
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00015.safetensors",
26
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00015.safetensors",
27
+ "model.layers.1.self_attn.q_norm.weight": "model-00001-of-00015.safetensors",
28
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00015.safetensors",
29
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00015.safetensors",
30
+ "model.layers.10.input_layernorm.weight": "model-00003-of-00015.safetensors",
31
+ "model.layers.10.mlp.down_proj.weight": "model-00003-of-00015.safetensors",
32
+ "model.layers.10.mlp.gate_proj.weight": "model-00003-of-00015.safetensors",
33
+ "model.layers.10.mlp.up_proj.weight": "model-00003-of-00015.safetensors",
34
+ "model.layers.10.gka.A_log": "model-00003-of-00015.safetensors",
35
+ "model.layers.10.gka.a_proj.weight": "model-00003-of-00015.safetensors",
36
+ "model.layers.10.gka.alpha_proj.bias": "model-00003-of-00015.safetensors",
37
+ "model.layers.10.gka.alpha_proj.weight": "model-00003-of-00015.safetensors",
38
+ "model.layers.10.gka.b_proj.bias": "model-00003-of-00015.safetensors",
39
+ "model.layers.10.gka.b_proj.weight": "model-00003-of-00015.safetensors",
40
+ "model.layers.10.gka.dt_bias": "model-00003-of-00015.safetensors",
41
+ "model.layers.10.gka.g_proj.weight": "model-00003-of-00015.safetensors",
42
+ "model.layers.10.gka.k_conv1d.weight": "model-00003-of-00015.safetensors",
43
+ "model.layers.10.gka.k_proj.weight": "model-00003-of-00015.safetensors",
44
+ "model.layers.10.gka.o_norm.weight": "model-00003-of-00015.safetensors",
45
+ "model.layers.10.gka.o_proj.weight": "model-00003-of-00015.safetensors",
46
+ "model.layers.10.gka.q_conv1d.weight": "model-00003-of-00015.safetensors",
47
+ "model.layers.10.gka.q_proj.weight": "model-00003-of-00015.safetensors",
48
+ "model.layers.10.gka.v_conv1d.weight": "model-00003-of-00015.safetensors",
49
+ "model.layers.10.gka.v_proj.weight": "model-00003-of-00015.safetensors",
50
+ "model.layers.10.post_attention_layernorm.weight": "model-00003-of-00015.safetensors",
51
+ "model.layers.11.input_layernorm.weight": "model-00003-of-00015.safetensors",
52
+ "model.layers.11.mlp.down_proj.weight": "model-00003-of-00015.safetensors",
53
+ "model.layers.11.mlp.gate_proj.weight": "model-00003-of-00015.safetensors",
54
+ "model.layers.11.mlp.up_proj.weight": "model-00003-of-00015.safetensors",
55
+ "model.layers.11.gka.A_log": "model-00003-of-00015.safetensors",
56
+ "model.layers.11.gka.a_proj.weight": "model-00003-of-00015.safetensors",
57
+ "model.layers.11.gka.alpha_proj.bias": "model-00003-of-00015.safetensors",
58
+ "model.layers.11.gka.alpha_proj.weight": "model-00003-of-00015.safetensors",
59
+ "model.layers.11.gka.b_proj.bias": "model-00003-of-00015.safetensors",
60
+ "model.layers.11.gka.b_proj.weight": "model-00003-of-00015.safetensors",
61
+ "model.layers.11.gka.dt_bias": "model-00003-of-00015.safetensors",
62
+ "model.layers.11.gka.g_proj.weight": "model-00003-of-00015.safetensors",
63
+ "model.layers.11.gka.k_conv1d.weight": "model-00003-of-00015.safetensors",
64
+ "model.layers.11.gka.k_proj.weight": "model-00003-of-00015.safetensors",
65
+ "model.layers.11.gka.o_norm.weight": "model-00003-of-00015.safetensors",
66
+ "model.layers.11.gka.o_proj.weight": "model-00003-of-00015.safetensors",
67
+ "model.layers.11.gka.q_conv1d.weight": "model-00003-of-00015.safetensors",
68
+ "model.layers.11.gka.q_proj.weight": "model-00003-of-00015.safetensors",
69
+ "model.layers.11.gka.v_conv1d.weight": "model-00003-of-00015.safetensors",
70
+ "model.layers.11.gka.v_proj.weight": "model-00003-of-00015.safetensors",
71
+ "model.layers.11.post_attention_layernorm.weight": "model-00003-of-00015.safetensors",
72
+ "model.layers.12.input_layernorm.weight": "model-00003-of-00015.safetensors",
73
+ "model.layers.12.mlp.down_proj.weight": "model-00004-of-00015.safetensors",
74
+ "model.layers.12.mlp.gate_proj.weight": "model-00003-of-00015.safetensors",
75
+ "model.layers.12.mlp.up_proj.weight": "model-00003-of-00015.safetensors",
76
+ "model.layers.12.gka.A_log": "model-00004-of-00015.safetensors",
77
+ "model.layers.12.gka.a_proj.weight": "model-00004-of-00015.safetensors",
78
+ "model.layers.12.gka.alpha_proj.bias": "model-00004-of-00015.safetensors",
79
+ "model.layers.12.gka.alpha_proj.weight": "model-00004-of-00015.safetensors",
80
+ "model.layers.12.gka.b_proj.bias": "model-00004-of-00015.safetensors",
81
+ "model.layers.12.gka.b_proj.weight": "model-00004-of-00015.safetensors",
82
+ "model.layers.12.gka.dt_bias": "model-00004-of-00015.safetensors",
83
+ "model.layers.12.gka.g_proj.weight": "model-00004-of-00015.safetensors",
84
+ "model.layers.12.gka.k_conv1d.weight": "model-00004-of-00015.safetensors",
85
+ "model.layers.12.gka.k_proj.weight": "model-00004-of-00015.safetensors",
86
+ "model.layers.12.gka.o_norm.weight": "model-00004-of-00015.safetensors",
87
+ "model.layers.12.gka.o_proj.weight": "model-00004-of-00015.safetensors",
88
+ "model.layers.12.gka.q_conv1d.weight": "model-00004-of-00015.safetensors",
89
+ "model.layers.12.gka.q_proj.weight": "model-00004-of-00015.safetensors",
90
+ "model.layers.12.gka.v_conv1d.weight": "model-00004-of-00015.safetensors",
91
+ "model.layers.12.gka.v_proj.weight": "model-00004-of-00015.safetensors",
92
+ "model.layers.12.post_attention_layernorm.weight": "model-00003-of-00015.safetensors",
93
+ "model.layers.13.input_layernorm.weight": "model-00004-of-00015.safetensors",
94
+ "model.layers.13.mlp.down_proj.weight": "model-00004-of-00015.safetensors",
95
+ "model.layers.13.mlp.gate_proj.weight": "model-00004-of-00015.safetensors",
96
+ "model.layers.13.mlp.up_proj.weight": "model-00004-of-00015.safetensors",
97
+ "model.layers.13.gka.A_log": "model-00004-of-00015.safetensors",
98
+ "model.layers.13.gka.a_proj.weight": "model-00004-of-00015.safetensors",
99
+ "model.layers.13.gka.alpha_proj.bias": "model-00004-of-00015.safetensors",
100
+ "model.layers.13.gka.alpha_proj.weight": "model-00004-of-00015.safetensors",
101
+ "model.layers.13.gka.b_proj.bias": "model-00004-of-00015.safetensors",
102
+ "model.layers.13.gka.b_proj.weight": "model-00004-of-00015.safetensors",
103
+ "model.layers.13.gka.dt_bias": "model-00004-of-00015.safetensors",
104
+ "model.layers.13.gka.g_proj.weight": "model-00004-of-00015.safetensors",
105
+ "model.layers.13.gka.k_conv1d.weight": "model-00004-of-00015.safetensors",
106
+ "model.layers.13.gka.k_proj.weight": "model-00004-of-00015.safetensors",
107
+ "model.layers.13.gka.o_norm.weight": "model-00004-of-00015.safetensors",
108
+ "model.layers.13.gka.o_proj.weight": "model-00004-of-00015.safetensors",
109
+ "model.layers.13.gka.q_conv1d.weight": "model-00004-of-00015.safetensors",
110
+ "model.layers.13.gka.q_proj.weight": "model-00004-of-00015.safetensors",
111
+ "model.layers.13.gka.v_conv1d.weight": "model-00004-of-00015.safetensors",
112
+ "model.layers.13.gka.v_proj.weight": "model-00004-of-00015.safetensors",
113
+ "model.layers.13.post_attention_layernorm.weight": "model-00004-of-00015.safetensors",
114
+ "model.layers.14.input_layernorm.weight": "model-00004-of-00015.safetensors",
115
+ "model.layers.14.mlp.down_proj.weight": "model-00004-of-00015.safetensors",
116
+ "model.layers.14.mlp.gate_proj.weight": "model-00004-of-00015.safetensors",
117
+ "model.layers.14.mlp.up_proj.weight": "model-00004-of-00015.safetensors",
118
+ "model.layers.14.gka.A_log": "model-00004-of-00015.safetensors",
119
+ "model.layers.14.gka.a_proj.weight": "model-00004-of-00015.safetensors",
120
+ "model.layers.14.gka.alpha_proj.bias": "model-00004-of-00015.safetensors",
121
+ "model.layers.14.gka.alpha_proj.weight": "model-00004-of-00015.safetensors",
122
+ "model.layers.14.gka.b_proj.bias": "model-00004-of-00015.safetensors",
123
+ "model.layers.14.gka.b_proj.weight": "model-00004-of-00015.safetensors",
124
+ "model.layers.14.gka.dt_bias": "model-00004-of-00015.safetensors",
125
+ "model.layers.14.gka.g_proj.weight": "model-00004-of-00015.safetensors",
126
+ "model.layers.14.gka.k_conv1d.weight": "model-00004-of-00015.safetensors",
127
+ "model.layers.14.gka.k_proj.weight": "model-00004-of-00015.safetensors",
128
+ "model.layers.14.gka.o_norm.weight": "model-00004-of-00015.safetensors",
129
+ "model.layers.14.gka.o_proj.weight": "model-00004-of-00015.safetensors",
130
+ "model.layers.14.gka.q_conv1d.weight": "model-00004-of-00015.safetensors",
131
+ "model.layers.14.gka.q_proj.weight": "model-00004-of-00015.safetensors",
132
+ "model.layers.14.gka.v_conv1d.weight": "model-00004-of-00015.safetensors",
133
+ "model.layers.14.gka.v_proj.weight": "model-00004-of-00015.safetensors",
134
+ "model.layers.14.post_attention_layernorm.weight": "model-00004-of-00015.safetensors",
135
+ "model.layers.15.input_layernorm.weight": "model-00004-of-00015.safetensors",
136
+ "model.layers.15.mlp.down_proj.weight": "model-00004-of-00015.safetensors",
137
+ "model.layers.15.mlp.gate_proj.weight": "model-00004-of-00015.safetensors",
138
+ "model.layers.15.mlp.up_proj.weight": "model-00004-of-00015.safetensors",
139
+ "model.layers.15.gka.A_log": "model-00004-of-00015.safetensors",
140
+ "model.layers.15.gka.a_proj.weight": "model-00004-of-00015.safetensors",
141
+ "model.layers.15.gka.alpha_proj.bias": "model-00004-of-00015.safetensors",
142
+ "model.layers.15.gka.alpha_proj.weight": "model-00004-of-00015.safetensors",
143
+ "model.layers.15.gka.b_proj.bias": "model-00004-of-00015.safetensors",
144
+ "model.layers.15.gka.b_proj.weight": "model-00004-of-00015.safetensors",
145
+ "model.layers.15.gka.dt_bias": "model-00004-of-00015.safetensors",
146
+ "model.layers.15.gka.g_proj.weight": "model-00004-of-00015.safetensors",
147
+ "model.layers.15.gka.k_conv1d.weight": "model-00004-of-00015.safetensors",
148
+ "model.layers.15.gka.k_proj.weight": "model-00004-of-00015.safetensors",
149
+ "model.layers.15.gka.o_norm.weight": "model-00004-of-00015.safetensors",
150
+ "model.layers.15.gka.o_proj.weight": "model-00004-of-00015.safetensors",
151
+ "model.layers.15.gka.q_conv1d.weight": "model-00004-of-00015.safetensors",
152
+ "model.layers.15.gka.q_proj.weight": "model-00004-of-00015.safetensors",
153
+ "model.layers.15.gka.v_conv1d.weight": "model-00004-of-00015.safetensors",
154
+ "model.layers.15.gka.v_proj.weight": "model-00004-of-00015.safetensors",
155
+ "model.layers.15.post_attention_layernorm.weight": "model-00004-of-00015.safetensors",
156
+ "model.layers.16.input_layernorm.weight": "model-00004-of-00015.safetensors",
157
+ "model.layers.16.mlp.down_proj.weight": "model-00004-of-00015.safetensors",
158
+ "model.layers.16.mlp.gate_proj.weight": "model-00004-of-00015.safetensors",
159
+ "model.layers.16.mlp.up_proj.weight": "model-00004-of-00015.safetensors",
160
+ "model.layers.16.gka.A_log": "model-00004-of-00015.safetensors",
161
+ "model.layers.16.gka.a_proj.weight": "model-00004-of-00015.safetensors",
162
+ "model.layers.16.gka.alpha_proj.bias": "model-00004-of-00015.safetensors",
163
+ "model.layers.16.gka.alpha_proj.weight": "model-00004-of-00015.safetensors",
164
+ "model.layers.16.gka.b_proj.bias": "model-00004-of-00015.safetensors",
165
+ "model.layers.16.gka.b_proj.weight": "model-00004-of-00015.safetensors",
166
+ "model.layers.16.gka.dt_bias": "model-00004-of-00015.safetensors",
167
+ "model.layers.16.gka.g_proj.weight": "model-00004-of-00015.safetensors",
168
+ "model.layers.16.gka.k_conv1d.weight": "model-00004-of-00015.safetensors",
169
+ "model.layers.16.gka.k_proj.weight": "model-00004-of-00015.safetensors",
170
+ "model.layers.16.gka.o_norm.weight": "model-00004-of-00015.safetensors",
171
+ "model.layers.16.gka.o_proj.weight": "model-00004-of-00015.safetensors",
172
+ "model.layers.16.gka.q_conv1d.weight": "model-00004-of-00015.safetensors",
173
+ "model.layers.16.gka.q_proj.weight": "model-00004-of-00015.safetensors",
174
+ "model.layers.16.gka.v_conv1d.weight": "model-00004-of-00015.safetensors",
175
+ "model.layers.16.gka.v_proj.weight": "model-00004-of-00015.safetensors",
176
+ "model.layers.16.post_attention_layernorm.weight": "model-00004-of-00015.safetensors",
177
+ "model.layers.17.input_layernorm.weight": "model-00005-of-00015.safetensors",
178
+ "model.layers.17.mlp.down_proj.weight": "model-00005-of-00015.safetensors",
179
+ "model.layers.17.mlp.gate_proj.weight": "model-00005-of-00015.safetensors",
180
+ "model.layers.17.mlp.up_proj.weight": "model-00005-of-00015.safetensors",
181
+ "model.layers.17.post_attention_layernorm.weight": "model-00005-of-00015.safetensors",
182
+ "model.layers.17.self_attn.k_norm.weight": "model-00004-of-00015.safetensors",
183
+ "model.layers.17.self_attn.k_proj.weight": "model-00004-of-00015.safetensors",
184
+ "model.layers.17.self_attn.o_proj.weight": "model-00004-of-00015.safetensors",
185
+ "model.layers.17.self_attn.q_norm.weight": "model-00004-of-00015.safetensors",
186
+ "model.layers.17.self_attn.q_proj.weight": "model-00004-of-00015.safetensors",
187
+ "model.layers.17.self_attn.v_proj.weight": "model-00004-of-00015.safetensors",
188
+ "model.layers.18.input_layernorm.weight": "model-00005-of-00015.safetensors",
189
+ "model.layers.18.mlp.down_proj.weight": "model-00005-of-00015.safetensors",
190
+ "model.layers.18.mlp.gate_proj.weight": "model-00005-of-00015.safetensors",
191
+ "model.layers.18.mlp.up_proj.weight": "model-00005-of-00015.safetensors",
192
+ "model.layers.18.post_attention_layernorm.weight": "model-00005-of-00015.safetensors",
193
+ "model.layers.18.self_attn.k_norm.weight": "model-00005-of-00015.safetensors",
194
+ "model.layers.18.self_attn.k_proj.weight": "model-00005-of-00015.safetensors",
195
+ "model.layers.18.self_attn.o_proj.weight": "model-00005-of-00015.safetensors",
196
+ "model.layers.18.self_attn.q_norm.weight": "model-00005-of-00015.safetensors",
197
+ "model.layers.18.self_attn.q_proj.weight": "model-00005-of-00015.safetensors",
198
+ "model.layers.18.self_attn.v_proj.weight": "model-00005-of-00015.safetensors",
199
+ "model.layers.19.input_layernorm.weight": "model-00005-of-00015.safetensors",
200
+ "model.layers.19.mlp.down_proj.weight": "model-00005-of-00015.safetensors",
201
+ "model.layers.19.mlp.gate_proj.weight": "model-00005-of-00015.safetensors",
202
+ "model.layers.19.mlp.up_proj.weight": "model-00005-of-00015.safetensors",
203
+ "model.layers.19.post_attention_layernorm.weight": "model-00005-of-00015.safetensors",
204
+ "model.layers.19.self_attn.k_norm.weight": "model-00005-of-00015.safetensors",
205
+ "model.layers.19.self_attn.k_proj.weight": "model-00005-of-00015.safetensors",
206
+ "model.layers.19.self_attn.o_proj.weight": "model-00005-of-00015.safetensors",
207
+ "model.layers.19.self_attn.q_norm.weight": "model-00005-of-00015.safetensors",
208
+ "model.layers.19.self_attn.q_proj.weight": "model-00005-of-00015.safetensors",
209
+ "model.layers.19.self_attn.v_proj.weight": "model-00005-of-00015.safetensors",
210
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00015.safetensors",
211
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00015.safetensors",
212
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00015.safetensors",
213
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00015.safetensors",
214
+ "model.layers.2.gka.A_log": "model-00001-of-00015.safetensors",
215
+ "model.layers.2.gka.a_proj.weight": "model-00001-of-00015.safetensors",
216
+ "model.layers.2.gka.alpha_proj.bias": "model-00001-of-00015.safetensors",
217
+ "model.layers.2.gka.alpha_proj.weight": "model-00001-of-00015.safetensors",
218
+ "model.layers.2.gka.b_proj.bias": "model-00001-of-00015.safetensors",
219
+ "model.layers.2.gka.b_proj.weight": "model-00001-of-00015.safetensors",
220
+ "model.layers.2.gka.dt_bias": "model-00001-of-00015.safetensors",
221
+ "model.layers.2.gka.g_proj.weight": "model-00001-of-00015.safetensors",
222
+ "model.layers.2.gka.k_conv1d.weight": "model-00001-of-00015.safetensors",
223
+ "model.layers.2.gka.k_proj.weight": "model-00001-of-00015.safetensors",
224
+ "model.layers.2.gka.o_norm.weight": "model-00001-of-00015.safetensors",
225
+ "model.layers.2.gka.o_proj.weight": "model-00001-of-00015.safetensors",
226
+ "model.layers.2.gka.q_conv1d.weight": "model-00001-of-00015.safetensors",
227
+ "model.layers.2.gka.q_proj.weight": "model-00001-of-00015.safetensors",
228
+ "model.layers.2.gka.v_conv1d.weight": "model-00001-of-00015.safetensors",
229
+ "model.layers.2.gka.v_proj.weight": "model-00001-of-00015.safetensors",
230
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00015.safetensors",
231
+ "model.layers.20.input_layernorm.weight": "model-00005-of-00015.safetensors",
232
+ "model.layers.20.mlp.down_proj.weight": "model-00005-of-00015.safetensors",
233
+ "model.layers.20.mlp.gate_proj.weight": "model-00005-of-00015.safetensors",
234
+ "model.layers.20.mlp.up_proj.weight": "model-00005-of-00015.safetensors",
235
+ "model.layers.20.gka.A_log": "model-00005-of-00015.safetensors",
236
+ "model.layers.20.gka.a_proj.weight": "model-00005-of-00015.safetensors",
237
+ "model.layers.20.gka.alpha_proj.bias": "model-00005-of-00015.safetensors",
238
+ "model.layers.20.gka.alpha_proj.weight": "model-00005-of-00015.safetensors",
239
+ "model.layers.20.gka.b_proj.bias": "model-00005-of-00015.safetensors",
240
+ "model.layers.20.gka.b_proj.weight": "model-00005-of-00015.safetensors",
241
+ "model.layers.20.gka.dt_bias": "model-00005-of-00015.safetensors",
242
+ "model.layers.20.gka.g_proj.weight": "model-00005-of-00015.safetensors",
243
+ "model.layers.20.gka.k_conv1d.weight": "model-00005-of-00015.safetensors",
244
+ "model.layers.20.gka.k_proj.weight": "model-00005-of-00015.safetensors",
245
+ "model.layers.20.gka.o_norm.weight": "model-00005-of-00015.safetensors",
246
+ "model.layers.20.gka.o_proj.weight": "model-00005-of-00015.safetensors",
247
+ "model.layers.20.gka.q_conv1d.weight": "model-00005-of-00015.safetensors",
248
+ "model.layers.20.gka.q_proj.weight": "model-00005-of-00015.safetensors",
249
+ "model.layers.20.gka.v_conv1d.weight": "model-00005-of-00015.safetensors",
250
+ "model.layers.20.gka.v_proj.weight": "model-00005-of-00015.safetensors",
251
+ "model.layers.20.post_attention_layernorm.weight": "model-00005-of-00015.safetensors",
252
+ "model.layers.21.input_layernorm.weight": "model-00005-of-00015.safetensors",
253
+ "model.layers.21.mlp.down_proj.weight": "model-00005-of-00015.safetensors",
254
+ "model.layers.21.mlp.gate_proj.weight": "model-00005-of-00015.safetensors",
255
+ "model.layers.21.mlp.up_proj.weight": "model-00005-of-00015.safetensors",
256
+ "model.layers.21.post_attention_layernorm.weight": "model-00005-of-00015.safetensors",
257
+ "model.layers.21.self_attn.k_norm.weight": "model-00005-of-00015.safetensors",
258
+ "model.layers.21.self_attn.k_proj.weight": "model-00005-of-00015.safetensors",
259
+ "model.layers.21.self_attn.o_proj.weight": "model-00005-of-00015.safetensors",
260
+ "model.layers.21.self_attn.q_norm.weight": "model-00005-of-00015.safetensors",
261
+ "model.layers.21.self_attn.q_proj.weight": "model-00005-of-00015.safetensors",
262
+ "model.layers.21.self_attn.v_proj.weight": "model-00005-of-00015.safetensors",
263
+ "model.layers.22.input_layernorm.weight": "model-00005-of-00015.safetensors",
264
+ "model.layers.22.mlp.down_proj.weight": "model-00006-of-00015.safetensors",
265
+ "model.layers.22.mlp.gate_proj.weight": "model-00006-of-00015.safetensors",
266
+ "model.layers.22.mlp.up_proj.weight": "model-00006-of-00015.safetensors",
267
+ "model.layers.22.gka.A_log": "model-00006-of-00015.safetensors",
268
+ "model.layers.22.gka.a_proj.weight": "model-00006-of-00015.safetensors",
269
+ "model.layers.22.gka.alpha_proj.bias": "model-00006-of-00015.safetensors",
270
+ "model.layers.22.gka.alpha_proj.weight": "model-00006-of-00015.safetensors",
271
+ "model.layers.22.gka.b_proj.bias": "model-00006-of-00015.safetensors",
272
+ "model.layers.22.gka.b_proj.weight": "model-00006-of-00015.safetensors",
273
+ "model.layers.22.gka.dt_bias": "model-00006-of-00015.safetensors",
274
+ "model.layers.22.gka.g_proj.weight": "model-00006-of-00015.safetensors",
275
+ "model.layers.22.gka.k_conv1d.weight": "model-00006-of-00015.safetensors",
276
+ "model.layers.22.gka.k_proj.weight": "model-00006-of-00015.safetensors",
277
+ "model.layers.22.gka.o_norm.weight": "model-00006-of-00015.safetensors",
278
+ "model.layers.22.gka.o_proj.weight": "model-00006-of-00015.safetensors",
279
+ "model.layers.22.gka.q_conv1d.weight": "model-00006-of-00015.safetensors",
280
+ "model.layers.22.gka.q_proj.weight": "model-00006-of-00015.safetensors",
281
+ "model.layers.22.gka.v_conv1d.weight": "model-00006-of-00015.safetensors",
282
+ "model.layers.22.gka.v_proj.weight": "model-00006-of-00015.safetensors",
283
+ "model.layers.22.post_attention_layernorm.weight": "model-00005-of-00015.safetensors",
284
+ "model.layers.23.input_layernorm.weight": "model-00006-of-00015.safetensors",
285
+ "model.layers.23.mlp.down_proj.weight": "model-00006-of-00015.safetensors",
286
+ "model.layers.23.mlp.gate_proj.weight": "model-00006-of-00015.safetensors",
287
+ "model.layers.23.mlp.up_proj.weight": "model-00006-of-00015.safetensors",
288
+ "model.layers.23.gka.A_log": "model-00006-of-00015.safetensors",
289
+ "model.layers.23.gka.a_proj.weight": "model-00006-of-00015.safetensors",
290
+ "model.layers.23.gka.alpha_proj.bias": "model-00006-of-00015.safetensors",
291
+ "model.layers.23.gka.alpha_proj.weight": "model-00006-of-00015.safetensors",
292
+ "model.layers.23.gka.b_proj.bias": "model-00006-of-00015.safetensors",
293
+ "model.layers.23.gka.b_proj.weight": "model-00006-of-00015.safetensors",
294
+ "model.layers.23.gka.dt_bias": "model-00006-of-00015.safetensors",
295
+ "model.layers.23.gka.g_proj.weight": "model-00006-of-00015.safetensors",
296
+ "model.layers.23.gka.k_conv1d.weight": "model-00006-of-00015.safetensors",
297
+ "model.layers.23.gka.k_proj.weight": "model-00006-of-00015.safetensors",
298
+ "model.layers.23.gka.o_norm.weight": "model-00006-of-00015.safetensors",
299
+ "model.layers.23.gka.o_proj.weight": "model-00006-of-00015.safetensors",
300
+ "model.layers.23.gka.q_conv1d.weight": "model-00006-of-00015.safetensors",
301
+ "model.layers.23.gka.q_proj.weight": "model-00006-of-00015.safetensors",
302
+ "model.layers.23.gka.v_conv1d.weight": "model-00006-of-00015.safetensors",
303
+ "model.layers.23.gka.v_proj.weight": "model-00006-of-00015.safetensors",
304
+ "model.layers.23.post_attention_layernorm.weight": "model-00006-of-00015.safetensors",
305
+ "model.layers.24.input_layernorm.weight": "model-00006-of-00015.safetensors",
306
+ "model.layers.24.mlp.down_proj.weight": "model-00006-of-00015.safetensors",
307
+ "model.layers.24.mlp.gate_proj.weight": "model-00006-of-00015.safetensors",
308
+ "model.layers.24.mlp.up_proj.weight": "model-00006-of-00015.safetensors",
309
+ "model.layers.24.gka.A_log": "model-00006-of-00015.safetensors",
310
+ "model.layers.24.gka.a_proj.weight": "model-00006-of-00015.safetensors",
311
+ "model.layers.24.gka.alpha_proj.bias": "model-00006-of-00015.safetensors",
312
+ "model.layers.24.gka.alpha_proj.weight": "model-00006-of-00015.safetensors",
313
+ "model.layers.24.gka.b_proj.bias": "model-00006-of-00015.safetensors",
314
+ "model.layers.24.gka.b_proj.weight": "model-00006-of-00015.safetensors",
315
+ "model.layers.24.gka.dt_bias": "model-00006-of-00015.safetensors",
316
+ "model.layers.24.gka.g_proj.weight": "model-00006-of-00015.safetensors",
317
+ "model.layers.24.gka.k_conv1d.weight": "model-00006-of-00015.safetensors",
318
+ "model.layers.24.gka.k_proj.weight": "model-00006-of-00015.safetensors",
319
+ "model.layers.24.gka.o_norm.weight": "model-00006-of-00015.safetensors",
320
+ "model.layers.24.gka.o_proj.weight": "model-00006-of-00015.safetensors",
321
+ "model.layers.24.gka.q_conv1d.weight": "model-00006-of-00015.safetensors",
322
+ "model.layers.24.gka.q_proj.weight": "model-00006-of-00015.safetensors",
323
+ "model.layers.24.gka.v_conv1d.weight": "model-00006-of-00015.safetensors",
324
+ "model.layers.24.gka.v_proj.weight": "model-00006-of-00015.safetensors",
325
+ "model.layers.24.post_attention_layernorm.weight": "model-00006-of-00015.safetensors",
326
+ "model.layers.25.input_layernorm.weight": "model-00006-of-00015.safetensors",
327
+ "model.layers.25.mlp.down_proj.weight": "model-00006-of-00015.safetensors",
328
+ "model.layers.25.mlp.gate_proj.weight": "model-00006-of-00015.safetensors",
329
+ "model.layers.25.mlp.up_proj.weight": "model-00006-of-00015.safetensors",
330
+ "model.layers.25.post_attention_layernorm.weight": "model-00006-of-00015.safetensors",
331
+ "model.layers.25.self_attn.k_norm.weight": "model-00006-of-00015.safetensors",
332
+ "model.layers.25.self_attn.k_proj.weight": "model-00006-of-00015.safetensors",
333
+ "model.layers.25.self_attn.o_proj.weight": "model-00006-of-00015.safetensors",
334
+ "model.layers.25.self_attn.q_norm.weight": "model-00006-of-00015.safetensors",
335
+ "model.layers.25.self_attn.q_proj.weight": "model-00006-of-00015.safetensors",
336
+ "model.layers.25.self_attn.v_proj.weight": "model-00006-of-00015.safetensors",
337
+ "model.layers.26.input_layernorm.weight": "model-00006-of-00015.safetensors",
338
+ "model.layers.26.mlp.down_proj.weight": "model-00006-of-00015.safetensors",
339
+ "model.layers.26.mlp.gate_proj.weight": "model-00006-of-00015.safetensors",
340
+ "model.layers.26.mlp.up_proj.weight": "model-00006-of-00015.safetensors",
341
+ "model.layers.26.gka.A_log": "model-00006-of-00015.safetensors",
342
+ "model.layers.26.gka.a_proj.weight": "model-00007-of-00015.safetensors",
343
+ "model.layers.26.gka.alpha_proj.bias": "model-00007-of-00015.safetensors",
344
+ "model.layers.26.gka.alpha_proj.weight": "model-00007-of-00015.safetensors",
345
+ "model.layers.26.gka.b_proj.bias": "model-00007-of-00015.safetensors",
346
+ "model.layers.26.gka.b_proj.weight": "model-00007-of-00015.safetensors",
347
+ "model.layers.26.gka.dt_bias": "model-00006-of-00015.safetensors",
348
+ "model.layers.26.gka.g_proj.weight": "model-00007-of-00015.safetensors",
349
+ "model.layers.26.gka.k_conv1d.weight": "model-00007-of-00015.safetensors",
350
+ "model.layers.26.gka.k_proj.weight": "model-00007-of-00015.safetensors",
351
+ "model.layers.26.gka.o_norm.weight": "model-00007-of-00015.safetensors",
352
+ "model.layers.26.gka.o_proj.weight": "model-00007-of-00015.safetensors",
353
+ "model.layers.26.gka.q_conv1d.weight": "model-00007-of-00015.safetensors",
354
+ "model.layers.26.gka.q_proj.weight": "model-00007-of-00015.safetensors",
355
+ "model.layers.26.gka.v_conv1d.weight": "model-00007-of-00015.safetensors",
356
+ "model.layers.26.gka.v_proj.weight": "model-00007-of-00015.safetensors",
357
+ "model.layers.26.post_attention_layernorm.weight": "model-00006-of-00015.safetensors",
358
+ "model.layers.27.input_layernorm.weight": "model-00007-of-00015.safetensors",
359
+ "model.layers.27.mlp.down_proj.weight": "model-00007-of-00015.safetensors",
360
+ "model.layers.27.mlp.gate_proj.weight": "model-00007-of-00015.safetensors",
361
+ "model.layers.27.mlp.up_proj.weight": "model-00007-of-00015.safetensors",
362
+ "model.layers.27.gka.A_log": "model-00007-of-00015.safetensors",
363
+ "model.layers.27.gka.a_proj.weight": "model-00007-of-00015.safetensors",
364
+ "model.layers.27.gka.alpha_proj.bias": "model-00007-of-00015.safetensors",
365
+ "model.layers.27.gka.alpha_proj.weight": "model-00007-of-00015.safetensors",
366
+ "model.layers.27.gka.b_proj.bias": "model-00007-of-00015.safetensors",
367
+ "model.layers.27.gka.b_proj.weight": "model-00007-of-00015.safetensors",
368
+ "model.layers.27.gka.dt_bias": "model-00007-of-00015.safetensors",
369
+ "model.layers.27.gka.g_proj.weight": "model-00007-of-00015.safetensors",
370
+ "model.layers.27.gka.k_conv1d.weight": "model-00007-of-00015.safetensors",
371
+ "model.layers.27.gka.k_proj.weight": "model-00007-of-00015.safetensors",
372
+ "model.layers.27.gka.o_norm.weight": "model-00007-of-00015.safetensors",
373
+ "model.layers.27.gka.o_proj.weight": "model-00007-of-00015.safetensors",
374
+ "model.layers.27.gka.q_conv1d.weight": "model-00007-of-00015.safetensors",
375
+ "model.layers.27.gka.q_proj.weight": "model-00007-of-00015.safetensors",
376
+ "model.layers.27.gka.v_conv1d.weight": "model-00007-of-00015.safetensors",
377
+ "model.layers.27.gka.v_proj.weight": "model-00007-of-00015.safetensors",
378
+ "model.layers.27.post_attention_layernorm.weight": "model-00007-of-00015.safetensors",
379
+ "model.layers.28.input_layernorm.weight": "model-00007-of-00015.safetensors",
380
+ "model.layers.28.mlp.down_proj.weight": "model-00007-of-00015.safetensors",
381
+ "model.layers.28.mlp.gate_proj.weight": "model-00007-of-00015.safetensors",
382
+ "model.layers.28.mlp.up_proj.weight": "model-00007-of-00015.safetensors",
383
+ "model.layers.28.gka.A_log": "model-00007-of-00015.safetensors",
384
+ "model.layers.28.gka.a_proj.weight": "model-00007-of-00015.safetensors",
385
+ "model.layers.28.gka.alpha_proj.bias": "model-00007-of-00015.safetensors",
386
+ "model.layers.28.gka.alpha_proj.weight": "model-00007-of-00015.safetensors",
387
+ "model.layers.28.gka.b_proj.bias": "model-00007-of-00015.safetensors",
388
+ "model.layers.28.gka.b_proj.weight": "model-00007-of-00015.safetensors",
389
+ "model.layers.28.gka.dt_bias": "model-00007-of-00015.safetensors",
390
+ "model.layers.28.gka.g_proj.weight": "model-00007-of-00015.safetensors",
391
+ "model.layers.28.gka.k_conv1d.weight": "model-00007-of-00015.safetensors",
392
+ "model.layers.28.gka.k_proj.weight": "model-00007-of-00015.safetensors",
393
+ "model.layers.28.gka.o_norm.weight": "model-00007-of-00015.safetensors",
394
+ "model.layers.28.gka.o_proj.weight": "model-00007-of-00015.safetensors",
395
+ "model.layers.28.gka.q_conv1d.weight": "model-00007-of-00015.safetensors",
396
+ "model.layers.28.gka.q_proj.weight": "model-00007-of-00015.safetensors",
397
+ "model.layers.28.gka.v_conv1d.weight": "model-00007-of-00015.safetensors",
398
+ "model.layers.28.gka.v_proj.weight": "model-00007-of-00015.safetensors",
399
+ "model.layers.28.post_attention_layernorm.weight": "model-00007-of-00015.safetensors",
400
+ "model.layers.29.input_layernorm.weight": "model-00007-of-00015.safetensors",
401
+ "model.layers.29.mlp.down_proj.weight": "model-00007-of-00015.safetensors",
402
+ "model.layers.29.mlp.gate_proj.weight": "model-00007-of-00015.safetensors",
403
+ "model.layers.29.mlp.up_proj.weight": "model-00007-of-00015.safetensors",
404
+ "model.layers.29.gka.A_log": "model-00007-of-00015.safetensors",
405
+ "model.layers.29.gka.a_proj.weight": "model-00007-of-00015.safetensors",
406
+ "model.layers.29.gka.alpha_proj.bias": "model-00007-of-00015.safetensors",
407
+ "model.layers.29.gka.alpha_proj.weight": "model-00007-of-00015.safetensors",
408
+ "model.layers.29.gka.b_proj.bias": "model-00007-of-00015.safetensors",
409
+ "model.layers.29.gka.b_proj.weight": "model-00007-of-00015.safetensors",
410
+ "model.layers.29.gka.dt_bias": "model-00007-of-00015.safetensors",
411
+ "model.layers.29.gka.g_proj.weight": "model-00007-of-00015.safetensors",
412
+ "model.layers.29.gka.k_conv1d.weight": "model-00007-of-00015.safetensors",
413
+ "model.layers.29.gka.k_proj.weight": "model-00007-of-00015.safetensors",
414
+ "model.layers.29.gka.o_norm.weight": "model-00007-of-00015.safetensors",
415
+ "model.layers.29.gka.o_proj.weight": "model-00007-of-00015.safetensors",
416
+ "model.layers.29.gka.q_conv1d.weight": "model-00007-of-00015.safetensors",
417
+ "model.layers.29.gka.q_proj.weight": "model-00007-of-00015.safetensors",
418
+ "model.layers.29.gka.v_conv1d.weight": "model-00007-of-00015.safetensors",
419
+ "model.layers.29.gka.v_proj.weight": "model-00007-of-00015.safetensors",
420
+ "model.layers.29.post_attention_layernorm.weight": "model-00007-of-00015.safetensors",
421
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00015.safetensors",
422
+ "model.layers.3.mlp.down_proj.weight": "model-00002-of-00015.safetensors",
423
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00015.safetensors",
424
+ "model.layers.3.mlp.up_proj.weight": "model-00002-of-00015.safetensors",
425
+ "model.layers.3.gka.A_log": "model-00002-of-00015.safetensors",
426
+ "model.layers.3.gka.a_proj.weight": "model-00002-of-00015.safetensors",
427
+ "model.layers.3.gka.alpha_proj.bias": "model-00002-of-00015.safetensors",
428
+ "model.layers.3.gka.alpha_proj.weight": "model-00002-of-00015.safetensors",
429
+ "model.layers.3.gka.b_proj.bias": "model-00002-of-00015.safetensors",
430
+ "model.layers.3.gka.b_proj.weight": "model-00002-of-00015.safetensors",
431
+ "model.layers.3.gka.dt_bias": "model-00002-of-00015.safetensors",
432
+ "model.layers.3.gka.g_proj.weight": "model-00002-of-00015.safetensors",
433
+ "model.layers.3.gka.k_conv1d.weight": "model-00002-of-00015.safetensors",
434
+ "model.layers.3.gka.k_proj.weight": "model-00002-of-00015.safetensors",
435
+ "model.layers.3.gka.o_norm.weight": "model-00002-of-00015.safetensors",
436
+ "model.layers.3.gka.o_proj.weight": "model-00002-of-00015.safetensors",
437
+ "model.layers.3.gka.q_conv1d.weight": "model-00002-of-00015.safetensors",
438
+ "model.layers.3.gka.q_proj.weight": "model-00002-of-00015.safetensors",
439
+ "model.layers.3.gka.v_conv1d.weight": "model-00002-of-00015.safetensors",
440
+ "model.layers.3.gka.v_proj.weight": "model-00002-of-00015.safetensors",
441
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00015.safetensors",
442
+ "model.layers.30.input_layernorm.weight": "model-00007-of-00015.safetensors",
443
+ "model.layers.30.mlp.down_proj.weight": "model-00007-of-00015.safetensors",
444
+ "model.layers.30.mlp.gate_proj.weight": "model-00007-of-00015.safetensors",
445
+ "model.layers.30.mlp.up_proj.weight": "model-00007-of-00015.safetensors",
446
+ "model.layers.30.post_attention_layernorm.weight": "model-00007-of-00015.safetensors",
447
+ "model.layers.30.self_attn.k_norm.weight": "model-00007-of-00015.safetensors",
448
+ "model.layers.30.self_attn.k_proj.weight": "model-00007-of-00015.safetensors",
449
+ "model.layers.30.self_attn.o_proj.weight": "model-00007-of-00015.safetensors",
450
+ "model.layers.30.self_attn.q_norm.weight": "model-00007-of-00015.safetensors",
451
+ "model.layers.30.self_attn.q_proj.weight": "model-00007-of-00015.safetensors",
452
+ "model.layers.30.self_attn.v_proj.weight": "model-00007-of-00015.safetensors",
453
+ "model.layers.31.input_layernorm.weight": "model-00008-of-00015.safetensors",
454
+ "model.layers.31.mlp.down_proj.weight": "model-00008-of-00015.safetensors",
455
+ "model.layers.31.mlp.gate_proj.weight": "model-00007-of-00015.safetensors",
456
+ "model.layers.31.mlp.up_proj.weight": "model-00008-of-00015.safetensors",
457
+ "model.layers.31.post_attention_layernorm.weight": "model-00008-of-00015.safetensors",
458
+ "model.layers.31.self_attn.k_norm.weight": "model-00007-of-00015.safetensors",
459
+ "model.layers.31.self_attn.k_proj.weight": "model-00007-of-00015.safetensors",
460
+ "model.layers.31.self_attn.o_proj.weight": "model-00007-of-00015.safetensors",
461
+ "model.layers.31.self_attn.q_norm.weight": "model-00007-of-00015.safetensors",
462
+ "model.layers.31.self_attn.q_proj.weight": "model-00007-of-00015.safetensors",
463
+ "model.layers.31.self_attn.v_proj.weight": "model-00007-of-00015.safetensors",
464
+ "model.layers.32.input_layernorm.weight": "model-00008-of-00015.safetensors",
465
+ "model.layers.32.mlp.down_proj.weight": "model-00008-of-00015.safetensors",
466
+ "model.layers.32.mlp.gate_proj.weight": "model-00008-of-00015.safetensors",
467
+ "model.layers.32.mlp.up_proj.weight": "model-00008-of-00015.safetensors",
468
+ "model.layers.32.gka.A_log": "model-00008-of-00015.safetensors",
469
+ "model.layers.32.gka.a_proj.weight": "model-00008-of-00015.safetensors",
470
+ "model.layers.32.gka.alpha_proj.bias": "model-00008-of-00015.safetensors",
471
+ "model.layers.32.gka.alpha_proj.weight": "model-00008-of-00015.safetensors",
472
+ "model.layers.32.gka.b_proj.bias": "model-00008-of-00015.safetensors",
473
+ "model.layers.32.gka.b_proj.weight": "model-00008-of-00015.safetensors",
474
+ "model.layers.32.gka.dt_bias": "model-00008-of-00015.safetensors",
475
+ "model.layers.32.gka.g_proj.weight": "model-00008-of-00015.safetensors",
476
+ "model.layers.32.gka.k_conv1d.weight": "model-00008-of-00015.safetensors",
477
+ "model.layers.32.gka.k_proj.weight": "model-00008-of-00015.safetensors",
478
+ "model.layers.32.gka.o_norm.weight": "model-00008-of-00015.safetensors",
479
+ "model.layers.32.gka.o_proj.weight": "model-00008-of-00015.safetensors",
480
+ "model.layers.32.gka.q_conv1d.weight": "model-00008-of-00015.safetensors",
481
+ "model.layers.32.gka.q_proj.weight": "model-00008-of-00015.safetensors",
482
+ "model.layers.32.gka.v_conv1d.weight": "model-00008-of-00015.safetensors",
483
+ "model.layers.32.gka.v_proj.weight": "model-00008-of-00015.safetensors",
484
+ "model.layers.32.post_attention_layernorm.weight": "model-00008-of-00015.safetensors",
485
+ "model.layers.33.input_layernorm.weight": "model-00008-of-00015.safetensors",
486
+ "model.layers.33.mlp.down_proj.weight": "model-00008-of-00015.safetensors",
487
+ "model.layers.33.mlp.gate_proj.weight": "model-00008-of-00015.safetensors",
488
+ "model.layers.33.mlp.up_proj.weight": "model-00008-of-00015.safetensors",
489
+ "model.layers.33.gka.A_log": "model-00008-of-00015.safetensors",
490
+ "model.layers.33.gka.a_proj.weight": "model-00008-of-00015.safetensors",
491
+ "model.layers.33.gka.alpha_proj.bias": "model-00008-of-00015.safetensors",
492
+ "model.layers.33.gka.alpha_proj.weight": "model-00008-of-00015.safetensors",
493
+ "model.layers.33.gka.b_proj.bias": "model-00008-of-00015.safetensors",
494
+ "model.layers.33.gka.b_proj.weight": "model-00008-of-00015.safetensors",
495
+ "model.layers.33.gka.dt_bias": "model-00008-of-00015.safetensors",
496
+ "model.layers.33.gka.g_proj.weight": "model-00008-of-00015.safetensors",
497
+ "model.layers.33.gka.k_conv1d.weight": "model-00008-of-00015.safetensors",
498
+ "model.layers.33.gka.k_proj.weight": "model-00008-of-00015.safetensors",
499
+ "model.layers.33.gka.o_norm.weight": "model-00008-of-00015.safetensors",
500
+ "model.layers.33.gka.o_proj.weight": "model-00008-of-00015.safetensors",
501
+ "model.layers.33.gka.q_conv1d.weight": "model-00008-of-00015.safetensors",
502
+ "model.layers.33.gka.q_proj.weight": "model-00008-of-00015.safetensors",
503
+ "model.layers.33.gka.v_conv1d.weight": "model-00008-of-00015.safetensors",
504
+ "model.layers.33.gka.v_proj.weight": "model-00008-of-00015.safetensors",
505
+ "model.layers.33.post_attention_layernorm.weight": "model-00008-of-00015.safetensors",
506
+ "model.layers.34.input_layernorm.weight": "model-00008-of-00015.safetensors",
507
+ "model.layers.34.mlp.down_proj.weight": "model-00008-of-00015.safetensors",
508
+ "model.layers.34.mlp.gate_proj.weight": "model-00008-of-00015.safetensors",
509
+ "model.layers.34.mlp.up_proj.weight": "model-00008-of-00015.safetensors",
510
+ "model.layers.34.post_attention_layernorm.weight": "model-00008-of-00015.safetensors",
511
+ "model.layers.34.self_attn.k_norm.weight": "model-00008-of-00015.safetensors",
512
+ "model.layers.34.self_attn.k_proj.weight": "model-00008-of-00015.safetensors",
513
+ "model.layers.34.self_attn.o_proj.weight": "model-00008-of-00015.safetensors",
514
+ "model.layers.34.self_attn.q_norm.weight": "model-00008-of-00015.safetensors",
515
+ "model.layers.34.self_attn.q_proj.weight": "model-00008-of-00015.safetensors",
516
+ "model.layers.34.self_attn.v_proj.weight": "model-00008-of-00015.safetensors",
517
+ "model.layers.35.input_layernorm.weight": "model-00008-of-00015.safetensors",
518
+ "model.layers.35.mlp.down_proj.weight": "model-00008-of-00015.safetensors",
519
+ "model.layers.35.mlp.gate_proj.weight": "model-00008-of-00015.safetensors",
520
+ "model.layers.35.mlp.up_proj.weight": "model-00008-of-00015.safetensors",
521
+ "model.layers.35.gka.A_log": "model-00008-of-00015.safetensors",
522
+ "model.layers.35.gka.a_proj.weight": "model-00008-of-00015.safetensors",
523
+ "model.layers.35.gka.alpha_proj.bias": "model-00008-of-00015.safetensors",
524
+ "model.layers.35.gka.alpha_proj.weight": "model-00008-of-00015.safetensors",
525
+ "model.layers.35.gka.b_proj.bias": "model-00008-of-00015.safetensors",
526
+ "model.layers.35.gka.b_proj.weight": "model-00008-of-00015.safetensors",
527
+ "model.layers.35.gka.dt_bias": "model-00008-of-00015.safetensors",
528
+ "model.layers.35.gka.g_proj.weight": "model-00008-of-00015.safetensors",
529
+ "model.layers.35.gka.k_conv1d.weight": "model-00008-of-00015.safetensors",
530
+ "model.layers.35.gka.k_proj.weight": "model-00008-of-00015.safetensors",
531
+ "model.layers.35.gka.o_norm.weight": "model-00008-of-00015.safetensors",
532
+ "model.layers.35.gka.o_proj.weight": "model-00008-of-00015.safetensors",
533
+ "model.layers.35.gka.q_conv1d.weight": "model-00008-of-00015.safetensors",
534
+ "model.layers.35.gka.q_proj.weight": "model-00008-of-00015.safetensors",
535
+ "model.layers.35.gka.v_conv1d.weight": "model-00008-of-00015.safetensors",
536
+ "model.layers.35.gka.v_proj.weight": "model-00008-of-00015.safetensors",
537
+ "model.layers.35.post_attention_layernorm.weight": "model-00008-of-00015.safetensors",
538
+ "model.layers.36.input_layernorm.weight": "model-00008-of-00015.safetensors",
539
+ "model.layers.36.mlp.down_proj.weight": "model-00009-of-00015.safetensors",
540
+ "model.layers.36.mlp.gate_proj.weight": "model-00008-of-00015.safetensors",
541
+ "model.layers.36.mlp.up_proj.weight": "model-00009-of-00015.safetensors",
542
+ "model.layers.36.gka.A_log": "model-00009-of-00015.safetensors",
543
+ "model.layers.36.gka.a_proj.weight": "model-00009-of-00015.safetensors",
544
+ "model.layers.36.gka.alpha_proj.bias": "model-00009-of-00015.safetensors",
545
+ "model.layers.36.gka.alpha_proj.weight": "model-00009-of-00015.safetensors",
546
+ "model.layers.36.gka.b_proj.bias": "model-00009-of-00015.safetensors",
547
+ "model.layers.36.gka.b_proj.weight": "model-00009-of-00015.safetensors",
548
+ "model.layers.36.gka.dt_bias": "model-00009-of-00015.safetensors",
549
+ "model.layers.36.gka.g_proj.weight": "model-00009-of-00015.safetensors",
550
+ "model.layers.36.gka.k_conv1d.weight": "model-00009-of-00015.safetensors",
551
+ "model.layers.36.gka.k_proj.weight": "model-00009-of-00015.safetensors",
552
+ "model.layers.36.gka.o_norm.weight": "model-00009-of-00015.safetensors",
553
+ "model.layers.36.gka.o_proj.weight": "model-00009-of-00015.safetensors",
554
+ "model.layers.36.gka.q_conv1d.weight": "model-00009-of-00015.safetensors",
555
+ "model.layers.36.gka.q_proj.weight": "model-00009-of-00015.safetensors",
556
+ "model.layers.36.gka.v_conv1d.weight": "model-00009-of-00015.safetensors",
557
+ "model.layers.36.gka.v_proj.weight": "model-00009-of-00015.safetensors",
558
+ "model.layers.36.post_attention_layernorm.weight": "model-00008-of-00015.safetensors",
559
+ "model.layers.37.input_layernorm.weight": "model-00009-of-00015.safetensors",
560
+ "model.layers.37.mlp.down_proj.weight": "model-00009-of-00015.safetensors",
561
+ "model.layers.37.mlp.gate_proj.weight": "model-00009-of-00015.safetensors",
562
+ "model.layers.37.mlp.up_proj.weight": "model-00009-of-00015.safetensors",
563
+ "model.layers.37.post_attention_layernorm.weight": "model-00009-of-00015.safetensors",
564
+ "model.layers.37.self_attn.k_norm.weight": "model-00009-of-00015.safetensors",
565
+ "model.layers.37.self_attn.k_proj.weight": "model-00009-of-00015.safetensors",
566
+ "model.layers.37.self_attn.o_proj.weight": "model-00009-of-00015.safetensors",
567
+ "model.layers.37.self_attn.q_norm.weight": "model-00009-of-00015.safetensors",
568
+ "model.layers.37.self_attn.q_proj.weight": "model-00009-of-00015.safetensors",
569
+ "model.layers.37.self_attn.v_proj.weight": "model-00009-of-00015.safetensors",
570
+ "model.layers.38.input_layernorm.weight": "model-00009-of-00015.safetensors",
571
+ "model.layers.38.mlp.down_proj.weight": "model-00009-of-00015.safetensors",
572
+ "model.layers.38.mlp.gate_proj.weight": "model-00009-of-00015.safetensors",
573
+ "model.layers.38.mlp.up_proj.weight": "model-00009-of-00015.safetensors",
574
+ "model.layers.38.post_attention_layernorm.weight": "model-00009-of-00015.safetensors",
575
+ "model.layers.38.self_attn.k_norm.weight": "model-00009-of-00015.safetensors",
576
+ "model.layers.38.self_attn.k_proj.weight": "model-00009-of-00015.safetensors",
577
+ "model.layers.38.self_attn.o_proj.weight": "model-00009-of-00015.safetensors",
578
+ "model.layers.38.self_attn.q_norm.weight": "model-00009-of-00015.safetensors",
579
+ "model.layers.38.self_attn.q_proj.weight": "model-00009-of-00015.safetensors",
580
+ "model.layers.38.self_attn.v_proj.weight": "model-00009-of-00015.safetensors",
581
+ "model.layers.39.input_layernorm.weight": "model-00009-of-00015.safetensors",
582
+ "model.layers.39.mlp.down_proj.weight": "model-00009-of-00015.safetensors",
583
+ "model.layers.39.mlp.gate_proj.weight": "model-00009-of-00015.safetensors",
584
+ "model.layers.39.mlp.up_proj.weight": "model-00009-of-00015.safetensors",
585
+ "model.layers.39.gka.A_log": "model-00009-of-00015.safetensors",
586
+ "model.layers.39.gka.a_proj.weight": "model-00009-of-00015.safetensors",
587
+ "model.layers.39.gka.alpha_proj.bias": "model-00009-of-00015.safetensors",
588
+ "model.layers.39.gka.alpha_proj.weight": "model-00009-of-00015.safetensors",
589
+ "model.layers.39.gka.b_proj.bias": "model-00009-of-00015.safetensors",
590
+ "model.layers.39.gka.b_proj.weight": "model-00009-of-00015.safetensors",
591
+ "model.layers.39.gka.dt_bias": "model-00009-of-00015.safetensors",
592
+ "model.layers.39.gka.g_proj.weight": "model-00009-of-00015.safetensors",
593
+ "model.layers.39.gka.k_conv1d.weight": "model-00009-of-00015.safetensors",
594
+ "model.layers.39.gka.k_proj.weight": "model-00009-of-00015.safetensors",
595
+ "model.layers.39.gka.o_norm.weight": "model-00009-of-00015.safetensors",
596
+ "model.layers.39.gka.o_proj.weight": "model-00009-of-00015.safetensors",
597
+ "model.layers.39.gka.q_conv1d.weight": "model-00009-of-00015.safetensors",
598
+ "model.layers.39.gka.q_proj.weight": "model-00009-of-00015.safetensors",
599
+ "model.layers.39.gka.v_conv1d.weight": "model-00009-of-00015.safetensors",
600
+ "model.layers.39.gka.v_proj.weight": "model-00009-of-00015.safetensors",
601
+ "model.layers.39.post_attention_layernorm.weight": "model-00009-of-00015.safetensors",
602
+ "model.layers.4.input_layernorm.weight": "model-00002-of-00015.safetensors",
603
+ "model.layers.4.mlp.down_proj.weight": "model-00002-of-00015.safetensors",
604
+ "model.layers.4.mlp.gate_proj.weight": "model-00002-of-00015.safetensors",
605
+ "model.layers.4.mlp.up_proj.weight": "model-00002-of-00015.safetensors",
606
+ "model.layers.4.post_attention_layernorm.weight": "model-00002-of-00015.safetensors",
607
+ "model.layers.4.self_attn.k_norm.weight": "model-00002-of-00015.safetensors",
608
+ "model.layers.4.self_attn.k_proj.weight": "model-00002-of-00015.safetensors",
609
+ "model.layers.4.self_attn.o_proj.weight": "model-00002-of-00015.safetensors",
610
+ "model.layers.4.self_attn.q_norm.weight": "model-00002-of-00015.safetensors",
611
+ "model.layers.4.self_attn.q_proj.weight": "model-00002-of-00015.safetensors",
612
+ "model.layers.4.self_attn.v_proj.weight": "model-00002-of-00015.safetensors",
613
+ "model.layers.40.input_layernorm.weight": "model-00009-of-00015.safetensors",
614
+ "model.layers.40.mlp.down_proj.weight": "model-00009-of-00015.safetensors",
615
+ "model.layers.40.mlp.gate_proj.weight": "model-00009-of-00015.safetensors",
616
+ "model.layers.40.mlp.up_proj.weight": "model-00009-of-00015.safetensors",
617
+ "model.layers.40.gka.A_log": "model-00009-of-00015.safetensors",
618
+ "model.layers.40.gka.a_proj.weight": "model-00009-of-00015.safetensors",
619
+ "model.layers.40.gka.alpha_proj.bias": "model-00009-of-00015.safetensors",
620
+ "model.layers.40.gka.alpha_proj.weight": "model-00009-of-00015.safetensors",
621
+ "model.layers.40.gka.b_proj.bias": "model-00009-of-00015.safetensors",
622
+ "model.layers.40.gka.b_proj.weight": "model-00009-of-00015.safetensors",
623
+ "model.layers.40.gka.dt_bias": "model-00009-of-00015.safetensors",
624
+ "model.layers.40.gka.g_proj.weight": "model-00009-of-00015.safetensors",
625
+ "model.layers.40.gka.k_conv1d.weight": "model-00009-of-00015.safetensors",
626
+ "model.layers.40.gka.k_proj.weight": "model-00009-of-00015.safetensors",
627
+ "model.layers.40.gka.o_norm.weight": "model-00009-of-00015.safetensors",
628
+ "model.layers.40.gka.o_proj.weight": "model-00009-of-00015.safetensors",
629
+ "model.layers.40.gka.q_conv1d.weight": "model-00009-of-00015.safetensors",
630
+ "model.layers.40.gka.q_proj.weight": "model-00009-of-00015.safetensors",
631
+ "model.layers.40.gka.v_conv1d.weight": "model-00009-of-00015.safetensors",
632
+ "model.layers.40.gka.v_proj.weight": "model-00009-of-00015.safetensors",
633
+ "model.layers.40.post_attention_layernorm.weight": "model-00009-of-00015.safetensors",
634
+ "model.layers.41.input_layernorm.weight": "model-00009-of-00015.safetensors",
635
+ "model.layers.41.mlp.down_proj.weight": "model-00010-of-00015.safetensors",
636
+ "model.layers.41.mlp.gate_proj.weight": "model-00010-of-00015.safetensors",
637
+ "model.layers.41.mlp.up_proj.weight": "model-00010-of-00015.safetensors",
638
+ "model.layers.41.gka.A_log": "model-00010-of-00015.safetensors",
639
+ "model.layers.41.gka.a_proj.weight": "model-00010-of-00015.safetensors",
640
+ "model.layers.41.gka.alpha_proj.bias": "model-00010-of-00015.safetensors",
641
+ "model.layers.41.gka.alpha_proj.weight": "model-00010-of-00015.safetensors",
642
+ "model.layers.41.gka.b_proj.bias": "model-00010-of-00015.safetensors",
643
+ "model.layers.41.gka.b_proj.weight": "model-00010-of-00015.safetensors",
644
+ "model.layers.41.gka.dt_bias": "model-00010-of-00015.safetensors",
645
+ "model.layers.41.gka.g_proj.weight": "model-00010-of-00015.safetensors",
646
+ "model.layers.41.gka.k_conv1d.weight": "model-00010-of-00015.safetensors",
647
+ "model.layers.41.gka.k_proj.weight": "model-00010-of-00015.safetensors",
648
+ "model.layers.41.gka.o_norm.weight": "model-00010-of-00015.safetensors",
649
+ "model.layers.41.gka.o_proj.weight": "model-00010-of-00015.safetensors",
650
+ "model.layers.41.gka.q_conv1d.weight": "model-00010-of-00015.safetensors",
651
+ "model.layers.41.gka.q_proj.weight": "model-00010-of-00015.safetensors",
652
+ "model.layers.41.gka.v_conv1d.weight": "model-00010-of-00015.safetensors",
653
+ "model.layers.41.gka.v_proj.weight": "model-00010-of-00015.safetensors",
654
+ "model.layers.41.post_attention_layernorm.weight": "model-00009-of-00015.safetensors",
655
+ "model.layers.42.input_layernorm.weight": "model-00010-of-00015.safetensors",
656
+ "model.layers.42.mlp.down_proj.weight": "model-00010-of-00015.safetensors",
657
+ "model.layers.42.mlp.gate_proj.weight": "model-00010-of-00015.safetensors",
658
+ "model.layers.42.mlp.up_proj.weight": "model-00010-of-00015.safetensors",
659
+ "model.layers.42.post_attention_layernorm.weight": "model-00010-of-00015.safetensors",
660
+ "model.layers.42.self_attn.k_norm.weight": "model-00010-of-00015.safetensors",
661
+ "model.layers.42.self_attn.k_proj.weight": "model-00010-of-00015.safetensors",
662
+ "model.layers.42.self_attn.o_proj.weight": "model-00010-of-00015.safetensors",
663
+ "model.layers.42.self_attn.q_norm.weight": "model-00010-of-00015.safetensors",
664
+ "model.layers.42.self_attn.q_proj.weight": "model-00010-of-00015.safetensors",
665
+ "model.layers.42.self_attn.v_proj.weight": "model-00010-of-00015.safetensors",
666
+ "model.layers.43.input_layernorm.weight": "model-00010-of-00015.safetensors",
667
+ "model.layers.43.mlp.down_proj.weight": "model-00010-of-00015.safetensors",
668
+ "model.layers.43.mlp.gate_proj.weight": "model-00010-of-00015.safetensors",
669
+ "model.layers.43.mlp.up_proj.weight": "model-00010-of-00015.safetensors",
670
+ "model.layers.43.gka.A_log": "model-00010-of-00015.safetensors",
671
+ "model.layers.43.gka.a_proj.weight": "model-00010-of-00015.safetensors",
672
+ "model.layers.43.gka.alpha_proj.bias": "model-00010-of-00015.safetensors",
673
+ "model.layers.43.gka.alpha_proj.weight": "model-00010-of-00015.safetensors",
674
+ "model.layers.43.gka.b_proj.bias": "model-00010-of-00015.safetensors",
675
+ "model.layers.43.gka.b_proj.weight": "model-00010-of-00015.safetensors",
676
+ "model.layers.43.gka.dt_bias": "model-00010-of-00015.safetensors",
677
+ "model.layers.43.gka.g_proj.weight": "model-00010-of-00015.safetensors",
678
+ "model.layers.43.gka.k_conv1d.weight": "model-00010-of-00015.safetensors",
679
+ "model.layers.43.gka.k_proj.weight": "model-00010-of-00015.safetensors",
680
+ "model.layers.43.gka.o_norm.weight": "model-00010-of-00015.safetensors",
681
+ "model.layers.43.gka.o_proj.weight": "model-00010-of-00015.safetensors",
682
+ "model.layers.43.gka.q_conv1d.weight": "model-00010-of-00015.safetensors",
683
+ "model.layers.43.gka.q_proj.weight": "model-00010-of-00015.safetensors",
684
+ "model.layers.43.gka.v_conv1d.weight": "model-00010-of-00015.safetensors",
685
+ "model.layers.43.gka.v_proj.weight": "model-00010-of-00015.safetensors",
686
+ "model.layers.43.post_attention_layernorm.weight": "model-00010-of-00015.safetensors",
687
+ "model.layers.44.input_layernorm.weight": "model-00010-of-00015.safetensors",
688
+ "model.layers.44.mlp.down_proj.weight": "model-00010-of-00015.safetensors",
689
+ "model.layers.44.mlp.gate_proj.weight": "model-00010-of-00015.safetensors",
690
+ "model.layers.44.mlp.up_proj.weight": "model-00010-of-00015.safetensors",
691
+ "model.layers.44.gka.A_log": "model-00010-of-00015.safetensors",
692
+ "model.layers.44.gka.a_proj.weight": "model-00010-of-00015.safetensors",
693
+ "model.layers.44.gka.alpha_proj.bias": "model-00010-of-00015.safetensors",
694
+ "model.layers.44.gka.alpha_proj.weight": "model-00010-of-00015.safetensors",
695
+ "model.layers.44.gka.b_proj.bias": "model-00010-of-00015.safetensors",
696
+ "model.layers.44.gka.b_proj.weight": "model-00010-of-00015.safetensors",
697
+ "model.layers.44.gka.dt_bias": "model-00010-of-00015.safetensors",
698
+ "model.layers.44.gka.g_proj.weight": "model-00010-of-00015.safetensors",
699
+ "model.layers.44.gka.k_conv1d.weight": "model-00010-of-00015.safetensors",
700
+ "model.layers.44.gka.k_proj.weight": "model-00010-of-00015.safetensors",
701
+ "model.layers.44.gka.o_norm.weight": "model-00010-of-00015.safetensors",
702
+ "model.layers.44.gka.o_proj.weight": "model-00010-of-00015.safetensors",
703
+ "model.layers.44.gka.q_conv1d.weight": "model-00010-of-00015.safetensors",
704
+ "model.layers.44.gka.q_proj.weight": "model-00010-of-00015.safetensors",
705
+ "model.layers.44.gka.v_conv1d.weight": "model-00010-of-00015.safetensors",
706
+ "model.layers.44.gka.v_proj.weight": "model-00010-of-00015.safetensors",
707
+ "model.layers.44.post_attention_layernorm.weight": "model-00010-of-00015.safetensors",
708
+ "model.layers.45.input_layernorm.weight": "model-00011-of-00015.safetensors",
709
+ "model.layers.45.mlp.down_proj.weight": "model-00011-of-00015.safetensors",
710
+ "model.layers.45.mlp.gate_proj.weight": "model-00010-of-00015.safetensors",
711
+ "model.layers.45.mlp.up_proj.weight": "model-00010-of-00015.safetensors",
712
+ "model.layers.45.post_attention_layernorm.weight": "model-00011-of-00015.safetensors",
713
+ "model.layers.45.self_attn.k_norm.weight": "model-00010-of-00015.safetensors",
714
+ "model.layers.45.self_attn.k_proj.weight": "model-00010-of-00015.safetensors",
715
+ "model.layers.45.self_attn.o_proj.weight": "model-00010-of-00015.safetensors",
716
+ "model.layers.45.self_attn.q_norm.weight": "model-00010-of-00015.safetensors",
717
+ "model.layers.45.self_attn.q_proj.weight": "model-00010-of-00015.safetensors",
718
+ "model.layers.45.self_attn.v_proj.weight": "model-00010-of-00015.safetensors",
719
+ "model.layers.46.input_layernorm.weight": "model-00011-of-00015.safetensors",
720
+ "model.layers.46.mlp.down_proj.weight": "model-00011-of-00015.safetensors",
721
+ "model.layers.46.mlp.gate_proj.weight": "model-00011-of-00015.safetensors",
722
+ "model.layers.46.mlp.up_proj.weight": "model-00011-of-00015.safetensors",
723
+ "model.layers.46.post_attention_layernorm.weight": "model-00011-of-00015.safetensors",
724
+ "model.layers.46.self_attn.k_norm.weight": "model-00011-of-00015.safetensors",
725
+ "model.layers.46.self_attn.k_proj.weight": "model-00011-of-00015.safetensors",
726
+ "model.layers.46.self_attn.o_proj.weight": "model-00011-of-00015.safetensors",
727
+ "model.layers.46.self_attn.q_norm.weight": "model-00011-of-00015.safetensors",
728
+ "model.layers.46.self_attn.q_proj.weight": "model-00011-of-00015.safetensors",
729
+ "model.layers.46.self_attn.v_proj.weight": "model-00011-of-00015.safetensors",
730
+ "model.layers.47.input_layernorm.weight": "model-00011-of-00015.safetensors",
731
+ "model.layers.47.mlp.down_proj.weight": "model-00011-of-00015.safetensors",
732
+ "model.layers.47.mlp.gate_proj.weight": "model-00011-of-00015.safetensors",
733
+ "model.layers.47.mlp.up_proj.weight": "model-00011-of-00015.safetensors",
734
+ "model.layers.47.post_attention_layernorm.weight": "model-00011-of-00015.safetensors",
735
+ "model.layers.47.self_attn.k_norm.weight": "model-00011-of-00015.safetensors",
736
+ "model.layers.47.self_attn.k_proj.weight": "model-00011-of-00015.safetensors",
737
+ "model.layers.47.self_attn.o_proj.weight": "model-00011-of-00015.safetensors",
738
+ "model.layers.47.self_attn.q_norm.weight": "model-00011-of-00015.safetensors",
739
+ "model.layers.47.self_attn.q_proj.weight": "model-00011-of-00015.safetensors",
740
+ "model.layers.47.self_attn.v_proj.weight": "model-00011-of-00015.safetensors",
741
+ "model.layers.48.input_layernorm.weight": "model-00011-of-00015.safetensors",
742
+ "model.layers.48.mlp.down_proj.weight": "model-00011-of-00015.safetensors",
743
+ "model.layers.48.mlp.gate_proj.weight": "model-00011-of-00015.safetensors",
744
+ "model.layers.48.mlp.up_proj.weight": "model-00011-of-00015.safetensors",
745
+ "model.layers.48.post_attention_layernorm.weight": "model-00011-of-00015.safetensors",
746
+ "model.layers.48.self_attn.k_norm.weight": "model-00011-of-00015.safetensors",
747
+ "model.layers.48.self_attn.k_proj.weight": "model-00011-of-00015.safetensors",
748
+ "model.layers.48.self_attn.o_proj.weight": "model-00011-of-00015.safetensors",
749
+ "model.layers.48.self_attn.q_norm.weight": "model-00011-of-00015.safetensors",
750
+ "model.layers.48.self_attn.q_proj.weight": "model-00011-of-00015.safetensors",
751
+ "model.layers.48.self_attn.v_proj.weight": "model-00011-of-00015.safetensors",
752
+ "model.layers.49.input_layernorm.weight": "model-00011-of-00015.safetensors",
753
+ "model.layers.49.mlp.down_proj.weight": "model-00011-of-00015.safetensors",
754
+ "model.layers.49.mlp.gate_proj.weight": "model-00011-of-00015.safetensors",
755
+ "model.layers.49.mlp.up_proj.weight": "model-00011-of-00015.safetensors",
756
+ "model.layers.49.post_attention_layernorm.weight": "model-00011-of-00015.safetensors",
757
+ "model.layers.49.self_attn.k_norm.weight": "model-00011-of-00015.safetensors",
758
+ "model.layers.49.self_attn.k_proj.weight": "model-00011-of-00015.safetensors",
759
+ "model.layers.49.self_attn.o_proj.weight": "model-00011-of-00015.safetensors",
760
+ "model.layers.49.self_attn.q_norm.weight": "model-00011-of-00015.safetensors",
761
+ "model.layers.49.self_attn.q_proj.weight": "model-00011-of-00015.safetensors",
762
+ "model.layers.49.self_attn.v_proj.weight": "model-00011-of-00015.safetensors",
763
+ "model.layers.5.input_layernorm.weight": "model-00002-of-00015.safetensors",
764
+ "model.layers.5.mlp.down_proj.weight": "model-00002-of-00015.safetensors",
765
+ "model.layers.5.mlp.gate_proj.weight": "model-00002-of-00015.safetensors",
766
+ "model.layers.5.mlp.up_proj.weight": "model-00002-of-00015.safetensors",
767
+ "model.layers.5.post_attention_layernorm.weight": "model-00002-of-00015.safetensors",
768
+ "model.layers.5.self_attn.k_norm.weight": "model-00002-of-00015.safetensors",
769
+ "model.layers.5.self_attn.k_proj.weight": "model-00002-of-00015.safetensors",
770
+ "model.layers.5.self_attn.o_proj.weight": "model-00002-of-00015.safetensors",
771
+ "model.layers.5.self_attn.q_norm.weight": "model-00002-of-00015.safetensors",
772
+ "model.layers.5.self_attn.q_proj.weight": "model-00002-of-00015.safetensors",
773
+ "model.layers.5.self_attn.v_proj.weight": "model-00002-of-00015.safetensors",
774
+ "model.layers.50.input_layernorm.weight": "model-00012-of-00015.safetensors",
775
+ "model.layers.50.mlp.down_proj.weight": "model-00012-of-00015.safetensors",
776
+ "model.layers.50.mlp.gate_proj.weight": "model-00011-of-00015.safetensors",
777
+ "model.layers.50.mlp.up_proj.weight": "model-00011-of-00015.safetensors",
778
+ "model.layers.50.post_attention_layernorm.weight": "model-00012-of-00015.safetensors",
779
+ "model.layers.50.self_attn.k_norm.weight": "model-00011-of-00015.safetensors",
780
+ "model.layers.50.self_attn.k_proj.weight": "model-00011-of-00015.safetensors",
781
+ "model.layers.50.self_attn.o_proj.weight": "model-00011-of-00015.safetensors",
782
+ "model.layers.50.self_attn.q_norm.weight": "model-00011-of-00015.safetensors",
783
+ "model.layers.50.self_attn.q_proj.weight": "model-00011-of-00015.safetensors",
784
+ "model.layers.50.self_attn.v_proj.weight": "model-00011-of-00015.safetensors",
785
+ "model.layers.51.input_layernorm.weight": "model-00012-of-00015.safetensors",
786
+ "model.layers.51.mlp.down_proj.weight": "model-00012-of-00015.safetensors",
787
+ "model.layers.51.mlp.gate_proj.weight": "model-00012-of-00015.safetensors",
788
+ "model.layers.51.mlp.up_proj.weight": "model-00012-of-00015.safetensors",
789
+ "model.layers.51.post_attention_layernorm.weight": "model-00012-of-00015.safetensors",
790
+ "model.layers.51.self_attn.k_norm.weight": "model-00012-of-00015.safetensors",
791
+ "model.layers.51.self_attn.k_proj.weight": "model-00012-of-00015.safetensors",
792
+ "model.layers.51.self_attn.o_proj.weight": "model-00012-of-00015.safetensors",
793
+ "model.layers.51.self_attn.q_norm.weight": "model-00012-of-00015.safetensors",
794
+ "model.layers.51.self_attn.q_proj.weight": "model-00012-of-00015.safetensors",
795
+ "model.layers.51.self_attn.v_proj.weight": "model-00012-of-00015.safetensors",
796
+ "model.layers.52.input_layernorm.weight": "model-00012-of-00015.safetensors",
797
+ "model.layers.52.mlp.down_proj.weight": "model-00012-of-00015.safetensors",
798
+ "model.layers.52.mlp.gate_proj.weight": "model-00012-of-00015.safetensors",
799
+ "model.layers.52.mlp.up_proj.weight": "model-00012-of-00015.safetensors",
800
+ "model.layers.52.post_attention_layernorm.weight": "model-00012-of-00015.safetensors",
801
+ "model.layers.52.self_attn.k_norm.weight": "model-00012-of-00015.safetensors",
802
+ "model.layers.52.self_attn.k_proj.weight": "model-00012-of-00015.safetensors",
803
+ "model.layers.52.self_attn.o_proj.weight": "model-00012-of-00015.safetensors",
804
+ "model.layers.52.self_attn.q_norm.weight": "model-00012-of-00015.safetensors",
805
+ "model.layers.52.self_attn.q_proj.weight": "model-00012-of-00015.safetensors",
806
+ "model.layers.52.self_attn.v_proj.weight": "model-00012-of-00015.safetensors",
807
+ "model.layers.53.input_layernorm.weight": "model-00012-of-00015.safetensors",
808
+ "model.layers.53.mlp.down_proj.weight": "model-00012-of-00015.safetensors",
809
+ "model.layers.53.mlp.gate_proj.weight": "model-00012-of-00015.safetensors",
810
+ "model.layers.53.mlp.up_proj.weight": "model-00012-of-00015.safetensors",
811
+ "model.layers.53.post_attention_layernorm.weight": "model-00012-of-00015.safetensors",
812
+ "model.layers.53.self_attn.k_norm.weight": "model-00012-of-00015.safetensors",
813
+ "model.layers.53.self_attn.k_proj.weight": "model-00012-of-00015.safetensors",
814
+ "model.layers.53.self_attn.o_proj.weight": "model-00012-of-00015.safetensors",
815
+ "model.layers.53.self_attn.q_norm.weight": "model-00012-of-00015.safetensors",
816
+ "model.layers.53.self_attn.q_proj.weight": "model-00012-of-00015.safetensors",
817
+ "model.layers.53.self_attn.v_proj.weight": "model-00012-of-00015.safetensors",
818
+ "model.layers.54.input_layernorm.weight": "model-00012-of-00015.safetensors",
819
+ "model.layers.54.mlp.down_proj.weight": "model-00012-of-00015.safetensors",
820
+ "model.layers.54.mlp.gate_proj.weight": "model-00012-of-00015.safetensors",
821
+ "model.layers.54.mlp.up_proj.weight": "model-00012-of-00015.safetensors",
822
+ "model.layers.54.post_attention_layernorm.weight": "model-00012-of-00015.safetensors",
823
+ "model.layers.54.self_attn.k_norm.weight": "model-00012-of-00015.safetensors",
824
+ "model.layers.54.self_attn.k_proj.weight": "model-00012-of-00015.safetensors",
825
+ "model.layers.54.self_attn.o_proj.weight": "model-00012-of-00015.safetensors",
826
+ "model.layers.54.self_attn.q_norm.weight": "model-00012-of-00015.safetensors",
827
+ "model.layers.54.self_attn.q_proj.weight": "model-00012-of-00015.safetensors",
828
+ "model.layers.54.self_attn.v_proj.weight": "model-00012-of-00015.safetensors",
829
+ "model.layers.55.input_layernorm.weight": "model-00013-of-00015.safetensors",
830
+ "model.layers.55.mlp.down_proj.weight": "model-00013-of-00015.safetensors",
831
+ "model.layers.55.mlp.gate_proj.weight": "model-00012-of-00015.safetensors",
832
+ "model.layers.55.mlp.up_proj.weight": "model-00012-of-00015.safetensors",
833
+ "model.layers.55.post_attention_layernorm.weight": "model-00013-of-00015.safetensors",
834
+ "model.layers.55.self_attn.k_norm.weight": "model-00012-of-00015.safetensors",
835
+ "model.layers.55.self_attn.k_proj.weight": "model-00012-of-00015.safetensors",
836
+ "model.layers.55.self_attn.o_proj.weight": "model-00012-of-00015.safetensors",
837
+ "model.layers.55.self_attn.q_norm.weight": "model-00012-of-00015.safetensors",
838
+ "model.layers.55.self_attn.q_proj.weight": "model-00012-of-00015.safetensors",
839
+ "model.layers.55.self_attn.v_proj.weight": "model-00012-of-00015.safetensors",
840
+ "model.layers.56.input_layernorm.weight": "model-00013-of-00015.safetensors",
841
+ "model.layers.56.mlp.down_proj.weight": "model-00013-of-00015.safetensors",
842
+ "model.layers.56.mlp.gate_proj.weight": "model-00013-of-00015.safetensors",
843
+ "model.layers.56.mlp.up_proj.weight": "model-00013-of-00015.safetensors",
844
+ "model.layers.56.post_attention_layernorm.weight": "model-00013-of-00015.safetensors",
845
+ "model.layers.56.self_attn.k_norm.weight": "model-00013-of-00015.safetensors",
846
+ "model.layers.56.self_attn.k_proj.weight": "model-00013-of-00015.safetensors",
847
+ "model.layers.56.self_attn.o_proj.weight": "model-00013-of-00015.safetensors",
848
+ "model.layers.56.self_attn.q_norm.weight": "model-00013-of-00015.safetensors",
849
+ "model.layers.56.self_attn.q_proj.weight": "model-00013-of-00015.safetensors",
850
+ "model.layers.56.self_attn.v_proj.weight": "model-00013-of-00015.safetensors",
851
+ "model.layers.57.input_layernorm.weight": "model-00013-of-00015.safetensors",
852
+ "model.layers.57.mlp.down_proj.weight": "model-00013-of-00015.safetensors",
853
+ "model.layers.57.mlp.gate_proj.weight": "model-00013-of-00015.safetensors",
854
+ "model.layers.57.mlp.up_proj.weight": "model-00013-of-00015.safetensors",
855
+ "model.layers.57.gka.A_log": "model-00013-of-00015.safetensors",
856
+ "model.layers.57.gka.a_proj.weight": "model-00013-of-00015.safetensors",
857
+ "model.layers.57.gka.alpha_proj.bias": "model-00013-of-00015.safetensors",
858
+ "model.layers.57.gka.alpha_proj.weight": "model-00013-of-00015.safetensors",
859
+ "model.layers.57.gka.b_proj.bias": "model-00013-of-00015.safetensors",
860
+ "model.layers.57.gka.b_proj.weight": "model-00013-of-00015.safetensors",
861
+ "model.layers.57.gka.dt_bias": "model-00013-of-00015.safetensors",
862
+ "model.layers.57.gka.g_proj.weight": "model-00013-of-00015.safetensors",
863
+ "model.layers.57.gka.k_conv1d.weight": "model-00013-of-00015.safetensors",
864
+ "model.layers.57.gka.k_proj.weight": "model-00013-of-00015.safetensors",
865
+ "model.layers.57.gka.o_norm.weight": "model-00013-of-00015.safetensors",
866
+ "model.layers.57.gka.o_proj.weight": "model-00013-of-00015.safetensors",
867
+ "model.layers.57.gka.q_conv1d.weight": "model-00013-of-00015.safetensors",
868
+ "model.layers.57.gka.q_proj.weight": "model-00013-of-00015.safetensors",
869
+ "model.layers.57.gka.v_conv1d.weight": "model-00013-of-00015.safetensors",
870
+ "model.layers.57.gka.v_proj.weight": "model-00013-of-00015.safetensors",
871
+ "model.layers.57.post_attention_layernorm.weight": "model-00013-of-00015.safetensors",
872
+ "model.layers.58.input_layernorm.weight": "model-00013-of-00015.safetensors",
873
+ "model.layers.58.mlp.down_proj.weight": "model-00013-of-00015.safetensors",
874
+ "model.layers.58.mlp.gate_proj.weight": "model-00013-of-00015.safetensors",
875
+ "model.layers.58.mlp.up_proj.weight": "model-00013-of-00015.safetensors",
876
+ "model.layers.58.post_attention_layernorm.weight": "model-00013-of-00015.safetensors",
877
+ "model.layers.58.self_attn.k_norm.weight": "model-00013-of-00015.safetensors",
878
+ "model.layers.58.self_attn.k_proj.weight": "model-00013-of-00015.safetensors",
879
+ "model.layers.58.self_attn.o_proj.weight": "model-00013-of-00015.safetensors",
880
+ "model.layers.58.self_attn.q_norm.weight": "model-00013-of-00015.safetensors",
881
+ "model.layers.58.self_attn.q_proj.weight": "model-00013-of-00015.safetensors",
882
+ "model.layers.58.self_attn.v_proj.weight": "model-00013-of-00015.safetensors",
883
+ "model.layers.59.input_layernorm.weight": "model-00013-of-00015.safetensors",
884
+ "model.layers.59.mlp.down_proj.weight": "model-00013-of-00015.safetensors",
885
+ "model.layers.59.mlp.gate_proj.weight": "model-00013-of-00015.safetensors",
886
+ "model.layers.59.mlp.up_proj.weight": "model-00013-of-00015.safetensors",
887
+ "model.layers.59.post_attention_layernorm.weight": "model-00013-of-00015.safetensors",
888
+ "model.layers.59.self_attn.k_norm.weight": "model-00013-of-00015.safetensors",
889
+ "model.layers.59.self_attn.k_proj.weight": "model-00013-of-00015.safetensors",
890
+ "model.layers.59.self_attn.o_proj.weight": "model-00013-of-00015.safetensors",
891
+ "model.layers.59.self_attn.q_norm.weight": "model-00013-of-00015.safetensors",
892
+ "model.layers.59.self_attn.q_proj.weight": "model-00013-of-00015.safetensors",
893
+ "model.layers.59.self_attn.v_proj.weight": "model-00013-of-00015.safetensors",
894
+ "model.layers.6.input_layernorm.weight": "model-00002-of-00015.safetensors",
895
+ "model.layers.6.mlp.down_proj.weight": "model-00002-of-00015.safetensors",
896
+ "model.layers.6.mlp.gate_proj.weight": "model-00002-of-00015.safetensors",
897
+ "model.layers.6.mlp.up_proj.weight": "model-00002-of-00015.safetensors",
898
+ "model.layers.6.post_attention_layernorm.weight": "model-00002-of-00015.safetensors",
899
+ "model.layers.6.self_attn.k_norm.weight": "model-00002-of-00015.safetensors",
900
+ "model.layers.6.self_attn.k_proj.weight": "model-00002-of-00015.safetensors",
901
+ "model.layers.6.self_attn.o_proj.weight": "model-00002-of-00015.safetensors",
902
+ "model.layers.6.self_attn.q_norm.weight": "model-00002-of-00015.safetensors",
903
+ "model.layers.6.self_attn.q_proj.weight": "model-00002-of-00015.safetensors",
904
+ "model.layers.6.self_attn.v_proj.weight": "model-00002-of-00015.safetensors",
905
+ "model.layers.60.input_layernorm.weight": "model-00013-of-00015.safetensors",
906
+ "model.layers.60.mlp.down_proj.weight": "model-00014-of-00015.safetensors",
907
+ "model.layers.60.mlp.gate_proj.weight": "model-00013-of-00015.safetensors",
908
+ "model.layers.60.mlp.up_proj.weight": "model-00013-of-00015.safetensors",
909
+ "model.layers.60.gka.A_log": "model-00014-of-00015.safetensors",
910
+ "model.layers.60.gka.a_proj.weight": "model-00014-of-00015.safetensors",
911
+ "model.layers.60.gka.alpha_proj.bias": "model-00014-of-00015.safetensors",
912
+ "model.layers.60.gka.alpha_proj.weight": "model-00014-of-00015.safetensors",
913
+ "model.layers.60.gka.b_proj.bias": "model-00014-of-00015.safetensors",
914
+ "model.layers.60.gka.b_proj.weight": "model-00014-of-00015.safetensors",
915
+ "model.layers.60.gka.dt_bias": "model-00014-of-00015.safetensors",
916
+ "model.layers.60.gka.g_proj.weight": "model-00014-of-00015.safetensors",
917
+ "model.layers.60.gka.k_conv1d.weight": "model-00014-of-00015.safetensors",
918
+ "model.layers.60.gka.k_proj.weight": "model-00014-of-00015.safetensors",
919
+ "model.layers.60.gka.o_norm.weight": "model-00014-of-00015.safetensors",
920
+ "model.layers.60.gka.o_proj.weight": "model-00014-of-00015.safetensors",
921
+ "model.layers.60.gka.q_conv1d.weight": "model-00014-of-00015.safetensors",
922
+ "model.layers.60.gka.q_proj.weight": "model-00014-of-00015.safetensors",
923
+ "model.layers.60.gka.v_conv1d.weight": "model-00014-of-00015.safetensors",
924
+ "model.layers.60.gka.v_proj.weight": "model-00014-of-00015.safetensors",
925
+ "model.layers.60.post_attention_layernorm.weight": "model-00013-of-00015.safetensors",
926
+ "model.layers.61.input_layernorm.weight": "model-00014-of-00015.safetensors",
927
+ "model.layers.61.mlp.down_proj.weight": "model-00014-of-00015.safetensors",
928
+ "model.layers.61.mlp.gate_proj.weight": "model-00014-of-00015.safetensors",
929
+ "model.layers.61.mlp.up_proj.weight": "model-00014-of-00015.safetensors",
930
+ "model.layers.61.post_attention_layernorm.weight": "model-00014-of-00015.safetensors",
931
+ "model.layers.61.self_attn.k_norm.weight": "model-00014-of-00015.safetensors",
932
+ "model.layers.61.self_attn.k_proj.weight": "model-00014-of-00015.safetensors",
933
+ "model.layers.61.self_attn.o_proj.weight": "model-00014-of-00015.safetensors",
934
+ "model.layers.61.self_attn.q_norm.weight": "model-00014-of-00015.safetensors",
935
+ "model.layers.61.self_attn.q_proj.weight": "model-00014-of-00015.safetensors",
936
+ "model.layers.61.self_attn.v_proj.weight": "model-00014-of-00015.safetensors",
937
+ "model.layers.62.input_layernorm.weight": "model-00014-of-00015.safetensors",
938
+ "model.layers.62.mlp.down_proj.weight": "model-00014-of-00015.safetensors",
939
+ "model.layers.62.mlp.gate_proj.weight": "model-00014-of-00015.safetensors",
940
+ "model.layers.62.mlp.up_proj.weight": "model-00014-of-00015.safetensors",
941
+ "model.layers.62.post_attention_layernorm.weight": "model-00014-of-00015.safetensors",
942
+ "model.layers.62.self_attn.k_norm.weight": "model-00014-of-00015.safetensors",
943
+ "model.layers.62.self_attn.k_proj.weight": "model-00014-of-00015.safetensors",
944
+ "model.layers.62.self_attn.o_proj.weight": "model-00014-of-00015.safetensors",
945
+ "model.layers.62.self_attn.q_norm.weight": "model-00014-of-00015.safetensors",
946
+ "model.layers.62.self_attn.q_proj.weight": "model-00014-of-00015.safetensors",
947
+ "model.layers.62.self_attn.v_proj.weight": "model-00014-of-00015.safetensors",
948
+ "model.layers.63.input_layernorm.weight": "model-00014-of-00015.safetensors",
949
+ "model.layers.63.mlp.down_proj.weight": "model-00014-of-00015.safetensors",
950
+ "model.layers.63.mlp.gate_proj.weight": "model-00014-of-00015.safetensors",
951
+ "model.layers.63.mlp.up_proj.weight": "model-00014-of-00015.safetensors",
952
+ "model.layers.63.gka.A_log": "model-00014-of-00015.safetensors",
953
+ "model.layers.63.gka.a_proj.weight": "model-00014-of-00015.safetensors",
954
+ "model.layers.63.gka.alpha_proj.bias": "model-00014-of-00015.safetensors",
955
+ "model.layers.63.gka.alpha_proj.weight": "model-00014-of-00015.safetensors",
956
+ "model.layers.63.gka.b_proj.bias": "model-00014-of-00015.safetensors",
957
+ "model.layers.63.gka.b_proj.weight": "model-00014-of-00015.safetensors",
958
+ "model.layers.63.gka.dt_bias": "model-00014-of-00015.safetensors",
959
+ "model.layers.63.gka.g_proj.weight": "model-00014-of-00015.safetensors",
960
+ "model.layers.63.gka.k_conv1d.weight": "model-00014-of-00015.safetensors",
961
+ "model.layers.63.gka.k_proj.weight": "model-00014-of-00015.safetensors",
962
+ "model.layers.63.gka.o_norm.weight": "model-00014-of-00015.safetensors",
963
+ "model.layers.63.gka.o_proj.weight": "model-00014-of-00015.safetensors",
964
+ "model.layers.63.gka.q_conv1d.weight": "model-00014-of-00015.safetensors",
965
+ "model.layers.63.gka.q_proj.weight": "model-00014-of-00015.safetensors",
966
+ "model.layers.63.gka.v_conv1d.weight": "model-00014-of-00015.safetensors",
967
+ "model.layers.63.gka.v_proj.weight": "model-00014-of-00015.safetensors",
968
+ "model.layers.63.post_attention_layernorm.weight": "model-00014-of-00015.safetensors",
969
+ "model.layers.7.input_layernorm.weight": "model-00002-of-00015.safetensors",
970
+ "model.layers.7.mlp.down_proj.weight": "model-00002-of-00015.safetensors",
971
+ "model.layers.7.mlp.gate_proj.weight": "model-00002-of-00015.safetensors",
972
+ "model.layers.7.mlp.up_proj.weight": "model-00002-of-00015.safetensors",
973
+ "model.layers.7.gka.A_log": "model-00002-of-00015.safetensors",
974
+ "model.layers.7.gka.a_proj.weight": "model-00002-of-00015.safetensors",
975
+ "model.layers.7.gka.alpha_proj.bias": "model-00002-of-00015.safetensors",
976
+ "model.layers.7.gka.alpha_proj.weight": "model-00002-of-00015.safetensors",
977
+ "model.layers.7.gka.b_proj.bias": "model-00002-of-00015.safetensors",
978
+ "model.layers.7.gka.b_proj.weight": "model-00002-of-00015.safetensors",
979
+ "model.layers.7.gka.dt_bias": "model-00002-of-00015.safetensors",
980
+ "model.layers.7.gka.g_proj.weight": "model-00002-of-00015.safetensors",
981
+ "model.layers.7.gka.k_conv1d.weight": "model-00002-of-00015.safetensors",
982
+ "model.layers.7.gka.k_proj.weight": "model-00002-of-00015.safetensors",
983
+ "model.layers.7.gka.o_norm.weight": "model-00002-of-00015.safetensors",
984
+ "model.layers.7.gka.o_proj.weight": "model-00002-of-00015.safetensors",
985
+ "model.layers.7.gka.q_conv1d.weight": "model-00002-of-00015.safetensors",
986
+ "model.layers.7.gka.q_proj.weight": "model-00002-of-00015.safetensors",
987
+ "model.layers.7.gka.v_conv1d.weight": "model-00002-of-00015.safetensors",
988
+ "model.layers.7.gka.v_proj.weight": "model-00002-of-00015.safetensors",
989
+ "model.layers.7.post_attention_layernorm.weight": "model-00002-of-00015.safetensors",
990
+ "model.layers.8.input_layernorm.weight": "model-00002-of-00015.safetensors",
991
+ "model.layers.8.mlp.down_proj.weight": "model-00003-of-00015.safetensors",
992
+ "model.layers.8.mlp.gate_proj.weight": "model-00003-of-00015.safetensors",
993
+ "model.layers.8.mlp.up_proj.weight": "model-00003-of-00015.safetensors",
994
+ "model.layers.8.gka.A_log": "model-00003-of-00015.safetensors",
995
+ "model.layers.8.gka.a_proj.weight": "model-00003-of-00015.safetensors",
996
+ "model.layers.8.gka.alpha_proj.bias": "model-00003-of-00015.safetensors",
997
+ "model.layers.8.gka.alpha_proj.weight": "model-00003-of-00015.safetensors",
998
+ "model.layers.8.gka.b_proj.bias": "model-00003-of-00015.safetensors",
999
+ "model.layers.8.gka.b_proj.weight": "model-00003-of-00015.safetensors",
1000
+ "model.layers.8.gka.dt_bias": "model-00003-of-00015.safetensors",
1001
+ "model.layers.8.gka.g_proj.weight": "model-00003-of-00015.safetensors",
1002
+ "model.layers.8.gka.k_conv1d.weight": "model-00003-of-00015.safetensors",
1003
+ "model.layers.8.gka.k_proj.weight": "model-00003-of-00015.safetensors",
1004
+ "model.layers.8.gka.o_norm.weight": "model-00003-of-00015.safetensors",
1005
+ "model.layers.8.gka.o_proj.weight": "model-00003-of-00015.safetensors",
1006
+ "model.layers.8.gka.q_conv1d.weight": "model-00003-of-00015.safetensors",
1007
+ "model.layers.8.gka.q_proj.weight": "model-00003-of-00015.safetensors",
1008
+ "model.layers.8.gka.v_conv1d.weight": "model-00003-of-00015.safetensors",
1009
+ "model.layers.8.gka.v_proj.weight": "model-00003-of-00015.safetensors",
1010
+ "model.layers.8.post_attention_layernorm.weight": "model-00002-of-00015.safetensors",
1011
+ "model.layers.9.input_layernorm.weight": "model-00003-of-00015.safetensors",
1012
+ "model.layers.9.mlp.down_proj.weight": "model-00003-of-00015.safetensors",
1013
+ "model.layers.9.mlp.gate_proj.weight": "model-00003-of-00015.safetensors",
1014
+ "model.layers.9.mlp.up_proj.weight": "model-00003-of-00015.safetensors",
1015
+ "model.layers.9.gka.A_log": "model-00003-of-00015.safetensors",
1016
+ "model.layers.9.gka.a_proj.weight": "model-00003-of-00015.safetensors",
1017
+ "model.layers.9.gka.alpha_proj.bias": "model-00003-of-00015.safetensors",
1018
+ "model.layers.9.gka.alpha_proj.weight": "model-00003-of-00015.safetensors",
1019
+ "model.layers.9.gka.b_proj.bias": "model-00003-of-00015.safetensors",
1020
+ "model.layers.9.gka.b_proj.weight": "model-00003-of-00015.safetensors",
1021
+ "model.layers.9.gka.dt_bias": "model-00003-of-00015.safetensors",
1022
+ "model.layers.9.gka.g_proj.weight": "model-00003-of-00015.safetensors",
1023
+ "model.layers.9.gka.k_conv1d.weight": "model-00003-of-00015.safetensors",
1024
+ "model.layers.9.gka.k_proj.weight": "model-00003-of-00015.safetensors",
1025
+ "model.layers.9.gka.o_norm.weight": "model-00003-of-00015.safetensors",
1026
+ "model.layers.9.gka.o_proj.weight": "model-00003-of-00015.safetensors",
1027
+ "model.layers.9.gka.q_conv1d.weight": "model-00003-of-00015.safetensors",
1028
+ "model.layers.9.gka.q_proj.weight": "model-00003-of-00015.safetensors",
1029
+ "model.layers.9.gka.v_conv1d.weight": "model-00003-of-00015.safetensors",
1030
+ "model.layers.9.gka.v_proj.weight": "model-00003-of-00015.safetensors",
1031
+ "model.layers.9.post_attention_layernorm.weight": "model-00003-of-00015.safetensors",
1032
+ "model.norm.weight": "model-00014-of-00015.safetensors"
1033
+ }
1034
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be75606093db2094d7cd20f3c2f385c212750648bd6ea4fb2bf507a6a4c55506
3
+ size 11422650
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": null,
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|im_end|>",
7
+ "errors": "replace",
8
+ "is_local": true,
9
+ "model_max_length": 131072,
10
+ "pad_token": "<|endoftext|>",
11
+ "padding_side": "right",
12
+ "split_special_tokens": false,
13
+ "tokenizer_class": "Qwen2Tokenizer",
14
+ "unk_token": null
15
+ }