zancato commited on
Commit
08af3f6
·
0 Parent(s):

Super-squash branch 'main' using huggingface_hub

Browse files
.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - hybrid
7
+ - ssm
8
+ - state-space-model
9
+ - linear-attention
10
+ - mamba-2
11
+ - priming
12
+ - long-context
13
+ - instruction-tuned
14
+ base_model: Qwen/Qwen3-8B
15
+ paper:
16
+ - https://arxiv.org/abs/2405.21060
17
+ ---
18
+
19
+ # Mamba2-primed-HQwen3-8B-Instruct
20
+
21
+ Mamba2-primed-HQwen3-8B-Instruct is a Hybrid language model consisting of 50% Attention layers and 50% [Mamba2](https://arxiv.org/abs/2405.21060) layers, primed from [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) using the [Hybrid Model Factory](https://github.com/awslabs/hybrid-model-factory) Priming pipeline. The model is instruction-tuned and supports context lengths up to 128K tokens.
22
+
23
+ Mamba-2 is a State-Space Model layer with constant memory and linear compute cost in the sequence length.
24
+
25
+ By combining Attention with Mamba-2, our Hybrid model achieves up to **2× faster inference** at long contexts while **closely matching the base Transformer's quality**.
26
+
27
+ ## Links
28
+
29
+ - 📄 [Mamba-2 paper](https://arxiv.org/abs/2405.21060)
30
+ - 💻 [GitHub: Hybrid Model Factory](https://github.com/awslabs/hybrid-model-factory)
31
+
32
+
33
+
34
+ ## Why Hybrid?
35
+
36
+ Each Primed Hybrid model is initialized from a base Transformer by converting a portion of its Attention layers into State-Space Model (SSM) layers that maintain a fixed-size recurrent state instead of a growing KV cache. At a 50% Hybrid ratio, roughly half the KV cache (which grows linearly with sequence length) is replaced with fixed-size SSM state. The practical benefits:
37
+
38
+ - **Higher throughput at long contexts** — less memory on KV cache means more memory for batching
39
+ - **More concurrent sequences** — ~2× as many concurrent sequences before hitting memory limits
40
+ - **Growing advantage with context length** — at long contexts, Attention dominates the forward pass while SSM layers remain negligible in cost. Since the Hybrid model makes roughly half as many Attention calls as the base Transformer, the throughput advantage grows with context length
41
+
42
+ Increasing hybridization ratio, replacing more Attention layers with SSM layers, further reduces memory and increases throughput, typically at the expense of performance.
43
+
44
+
45
+ ## Model Overview
46
+
47
+ - **Type**: Causal Language Model (Hybrid Attention + SSM)
48
+ - **Base Model**: Qwen3-8B
49
+ - **Hybrid Layer Type**: Mamba-2
50
+ - **Hybrid Ratio**: 50% (18 Attention + 18 Mamba-2 layers)
51
+ - **Parameters**: ~8B
52
+ - **Context Length**: 128K natively
53
+ - **Precision**: bfloat16
54
+ - **License**: Apache 2.0
55
+
56
+
57
+ Note, this is an Instruct-tuned model and is not a thinking model, that is, it does not natively produce chain-of-thought thinking tokens in its generation trace.
58
+
59
+ ## Benchmark Results
60
+
61
+ Below we report benchmark performance for all our instruct-tuned Primed models. All Hybrid models use a 50% Hybrid ratio and are Primed from Qwen3-8B.
62
+
63
+ We consider two baselines:
64
+
65
+ 1. **Qwen3-8B (non-thinking, from HF)**: The original Qwen model evaluated in non-thinking mode, which is the intended mode for an Instruct model. This serves as the base Transformer from which we start the Priming procedure.
66
+ 2. **Qwen3-8B (Long)**: The Qwen model fine-tuned on our priming data, extending its native context length from 32K to 128K. All Primed Hybrid models use the same training hyperparameters and data as this baseline, making it a fair comparison for differing architectures.
67
+
68
+ On both long- and short-context benchmarks, our Primed Hybrid models closely match the performance of the Transformer model while having [considerably lower deployment costs](#inference-efficiency), showcasing the efficacy of the Priming process.
69
+
70
+ ### Long-Context Benchmarks
71
+
72
+ Evaluated on [HELMET](https://github.com/princeton-nlp/HELMET), [MRCR](https://huggingface.co/datasets/openai/mrcr), and [BABILong](https://github.com/booydar/babilong) across context lengths from 8K to 128K, using a weighted average with geometrically increasing weights for longer contexts.
73
+
74
+ The plot below shows performance averaged over context lengths from 8K to 128K.
75
+
76
+ **Note.** For the Qwen3-8B (non-thinking, from HF) model, we used YaRN to evaluate on long-context tasks as directed in the [model card](https://huggingface.co/Qwen/Qwen3-8B)
77
+
78
+ <img src="https://github.com/awslabs/hybrid-model-factory/blob/main/assets/figures/long_context_results_8B_models_final.png" title="Long Context Results 8B Models" />
79
+
80
+
81
+ **How close are the Hybrid models to the Transformer baseline on long context tasks?**
82
+ Primed GKA and GDN Hybrids are within ~1.5 points of Qwen3-8B (Long) on average, while being [1.5–2× faster at inference](#inference-efficiency). Primed BMOJO-F matches GKA/GDN in quality but is slower due to unfused SSM+SWA kernels ([details](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md)). Primed Mamba2 lags further behind (~3 point gap), consistent with GKA and GDN's higher expressivity.
83
+
84
+ **Why SSM layers over Sliding Window Attention (SWA)?**
85
+ All Hybrid SSM models outperform the Hybrid SWA model (50% Attention + 50% SWA, window size 512). Even though SWA uses ~2× the effective state size of GKA at BF16, SSM layers retain information from the remote past, while SWA forgets everything beyond its window.
86
+
87
+ ### Short-Context NLP Benchmarks
88
+
89
+ Evaluations on Tulu3-dev from [OLMES](https://github.com/allenai/olmes). All tasks are over a short-context length (≤ 8K).
90
+ Each category in the table below averages the following Tulu3-dev subtasks:
91
+ 1. Math: GSM8K, MATH,
92
+ 2. Knowledge: MMLU, PopQA, TruthfulQA,
93
+ 3. Coding: HumanEval, HumanEval+,
94
+ 4. Reasoning: BigBenchHard,
95
+ 5. Instruction Following: IFEval.
96
+
97
+ |Model | Math | Knowledge | Coding | Reasoning | Instruction Following | Average |
98
+ |---|---|---|---|----------|---|--------|
99
+ | Qwen3-8B [non-thinking, from HF] | 81.36 | 49.33 | 91.77 | 74.31 | 85.59 | 76.47 |
100
+ | Qwen3-8B [Long] | 64.56 | 49.75 | 91 | 76.27 | 74.49 | 71.21 |
101
+ | GKA-primed-HQwen3-8B-Instruct | 64.15 | 47.90 | 90.46 | 72.60 | 70.98 | 69.22 |
102
+ | GDN-primed-HQwen3-8B-Instruct | 59.54 | 48.41 | 91.18 | 72.97 | 73.57 | 69.13 |
103
+ | Mamba2-primed-HQwen3-8B-Instruct | 57.77 | 46.91 | 89.56 | 70.99 | 74.86 | 68.02 |
104
+ | BMOJOF-primed-HQwen3-8B-Instruct | 65.69 | 48.63 | 90.02 | 76.42 | 75.60 | 71.27 |
105
+
106
+
107
+ **How close are the Hybrid models to the Transformer baseline on short context tasks?**
108
+ All Primed Hybrid models are within ~3 points of Qwen3-8B (Long), using [< 0.5% of the base Transformer's pre-training token budget](#training-data). Note, BMOJO-F [w/ GKA] fully matches the Transformer baseline but is slower to deploy (see above).
109
+
110
+ **Which SSM layer type performs best?**
111
+ Among the non BMOJO-F Hybrids, GKA ranks first (~2 point gap with Qwen3-8B [Long]), followed by GDN, then Mamba2. This ranking correlates with the expressiveness order of their respective SSM layers.
112
+
113
+ ## About Mamba-2
114
+
115
+ Mamba-2 is a State-Space Model layer with diagonal transition dynamics and input-dependent gating. It processes sequences in linear time with constant memory, making it efficient for long-context inference. Mamba-2 is less expressive than Gated DeltaNet and Gated KalmaNet due to its diagonal structure, but benefits from well-optimized kernels.
116
+
117
+ For more details, see the [Mamba-2 paper](https://arxiv.org/abs/2405.21060).
118
+
119
+
120
+
121
+ ### Architecture Details
122
+
123
+ | Component | Details |
124
+ |-----------|------------------------------------------------------------------------------|
125
+ | Number of Layers | 36 (18 Attention + 18 Mamba-2) |
126
+ | Hidden Dimension | 4096 |
127
+ | Attention Heads | 32 (Q) / 8 (KV) |
128
+ | Head Dimension | 128 |
129
+ | Intermediate Dimension (FFN) | 12288 |
130
+ | Vocabulary Size | 151,936 |
131
+ | Position Encoding | RoPE (θ = 5,000,000) |
132
+ | Layer Layout | Mamba-2 layer indices were selected with our [*selective hybridization*](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/LayerSelection.md) procedure |
133
+
134
+
135
+
136
+
137
+ ### Inference Efficiency
138
+
139
+ Sustained decode throughput (tokens/s) on 8× H200 GPUs (TP=8), measured during pure decode with a saturated KV cache. Benchmarked with random data (no prefix-caching benefits). See the full [Inference guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#performance-benchmarks) for methodology and additional models.
140
+
141
+ | Model | 16K | 32K | 64K | 128K |
142
+ |------------------------------------|----------------|----------------|----------------|----------------|
143
+ | Mamba2-primed-HQwen3-8B-Instruct | 16,844 (1.88×) | 9,966 (1.93×) | 5,460 (1.99×) | 2,825 (2.30×) |
144
+ | GKA-primed-HQwen3-8B | 15,892 (1.78×) | 9,159 (1.77×) | 5,173 (1.89×) | 2,736 (2.23×) |
145
+ | GDN-primed-HQwen3-8B | 17,479 (1.95×) | 10,080 (1.95×) | 5,521 (2.01×) | 2,863 (2.33×) |
146
+ | BMOJOF-primed-HQwen3-8B | 7,854 (0.88×) | 5,597 (1.08×) | 3,573 (1.30×) | 2,153 (1.75×) |
147
+ | Qwen3-8B (Long) | 8,951 | 5,174 | 2,740 | 1,227 |
148
+
149
+ Mean TTFT at the Transformer's saturated batch size (Hybrid model has memory to spare):
150
+
151
+ | Model | 16K | 32K | 64K | 128K |
152
+ |------------------------------------|------------------|------------------|------------------|------------------|
153
+ | Mamba2-primed-HQwen3-8B-Instruct | 28,668 ms (1.03×) | 31,405 ms (0.96×) | 36,666 ms (0.86×) | 46,618 ms (0.74×) |
154
+ | GKA-primed-HQwen3-8B | 35,013 ms (1.26×) | 38,502 ms (1.18×) | 44,893 ms (1.06��) | 53,606 ms (0.85×) |
155
+ | GDN-primed-HQwen3-8B | 27,805 ms (1.00×) | 30,975 ms (0.95×) | 36,151 ms (0.85×) | 46,389 ms (0.74×) |
156
+ | BMOJOF-primed-HQwen3-8B | 44,763 ms (1.61×) | 47,600 ms (1.46×) | 52,272 ms (1.23×) | 61,702 ms (0.98×) |
157
+ | Qwen3-8B (Long) | 27,736 ms | 32,661 ms | 42,462 ms | 62,922 ms |
158
+
159
+ The decode throughput advantage grows with context length — from 1.88× at 16K to 2.30× at 128K — thanks to Mamba2 layers maintaining a fixed-size recurrent state instead of a growing KV cache. TTFT crosses over at 32K and reaches 0.74× (26% faster) at 128K.
160
+
161
+
162
+
163
+
164
+ ## Usage
165
+
166
+ ### With vLLM (recommended)
167
+
168
+ Install the [Hybrid Model Factory vLLM plugin](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#docker-recommended) in your local environment, then serve:
169
+
170
+ ```bash
171
+ vllm serve amazon/Mamba2-primed-HQwen3-8B-Instruct \
172
+ --enable-prefix-caching \
173
+ --mamba-cache-mode align \
174
+ --mamba-cache-dtype float32 \
175
+ --mamba-ssm-cache-dtype float32
176
+ ```
177
+
178
+ Query the server:
179
+
180
+ ```bash
181
+ curl http://localhost:8000/v1/chat/completions \
182
+ -H "Content-Type: application/json" \
183
+ -d '{
184
+ "model": "amazon/Mamba2-primed-HQwen3-8B-Instruct",
185
+ "messages": [
186
+ {"role": "system", "content": "You are a helpful assistant."},
187
+ {"role": "user", "content": "What is Linear Attention in the context of LLMs?"}
188
+ ]
189
+ }'
190
+ ```
191
+
192
+ > **Tip:** The `--mamba-cache-dtype float32` and `--mamba-ssm-cache-dtype float32` flags are important for accurate long-context generation. See the [Inference guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#recommended-flags-for-hybrid-models) for details on all recommended flags.
193
+
194
+ ### With HuggingFace Transformers
195
+ See the [Inference guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/Inference.md#recommended-flags-for-hybrid-models) for details on when we recommend the HuggingFace Transformers implementation as opposed to the highly optimized vLLM one.
196
+
197
+ ```python
198
+ from transformers import AutoModelForCausalLM, AutoTokenizer
199
+ import hmf.model.hybrid_zoo.models.model_register # Register Hybrid models
200
+
201
+ model = AutoModelForCausalLM.from_pretrained(
202
+ "amazon/Mamba2-primed-HQwen3-8B-Instruct", trust_remote_code=True
203
+ ).to("cuda")
204
+ tokenizer = AutoTokenizer.from_pretrained("amazon/Mamba2-primed-HQwen3-8B-Instruct")
205
+
206
+ messages = [{"role": "user", "content": "What is linear attention in the context of LLMs?"}]
207
+ prompt = tokenizer.apply_chat_template(
208
+ messages, tokenize=False, add_generation_prompt=True
209
+ )
210
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
211
+ outputs = model.generate(**inputs, max_new_tokens=256)
212
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
213
+ ```
214
+
215
+ ### Training-Free Context Extension
216
+
217
+ This model supports training-free context extension 2-4× its native context via an extension to Hybrid models of [PICASO cache composition](https://arxiv.org/abs/2502.17605). See the [State Composition guide](https://github.com/awslabs/hybrid-model-factory/blob/main/docs/StateComposition.md) for usage. Note, this is currently only supported in HF for now.
218
+
219
+
220
+ ## Training data
221
+ These models were produced through the multi-stage Priming pipeline from [Hybrid Model Factory](https://github.com/awslabs/hybrid-model-factory). Training data spans web documents, mathematics, long-context documents, and instruction-following and reasoning examples — each targeting a different capability axis. This diversity is critical: it allows the Priming procedure to convert a base Transformer into a more memory- and compute-efficient Hybrid architecture at nearly the same level of performance, using < 0.5% of the base Transformer model's pre-training token budget.
222
+
223
+ ## Responsible AI Considerations
224
+ At Amazon, we are committed to developing AI responsibly and take a people-centric approach that prioritizes education, science, and our customers, to integrate responsible AI across the end-to-end AI lifecycle. We believe the use of AI must respect the rule of law and human rights, and we encourage the safe and responsible development of AI. When downloaded or used in accordance with [AWS Responsible AI Policy](https://aws.amazon.com/ai/responsible-ai/policy/), developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
225
+ Please report model quality, risk, security vulnerabilities or Amazon AI Concerns [here](https://pages.awscloud.com/global-ln-gc-400-ai-service-cards-contact-us-registration.html).
226
+
227
+ ## Citation
228
+
229
+ ```bibtex
230
+ @software{hybrid_model_factory,
231
+ title = {Hybrid Model Factory},
232
+ year = {2026},
233
+ url = {https://github.com/awslabs/hybrid-model-factory}
234
+ }
235
+
236
+ @misc{dao2024transformersssmsgeneralizedmodels,
237
+ title={Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality},
238
+ author={Tri Dao and Albert Gu},
239
+ year={2024},
240
+ eprint={2405.21060},
241
+ archivePrefix={arXiv},
242
+ primaryClass={cs.LG},
243
+ url={https://arxiv.org/abs/2405.21060},
244
+ }
245
+
246
+ ```
247
+
248
+ ## License
249
+
250
+ This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
chat_template.jinja ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {{- messages[0].content + '\n\n' }}
5
+ {%- endif %}
6
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
7
+ {%- for tool in tools %}
8
+ {{- "\n" }}
9
+ {{- tool | tojson }}
10
+ {%- endfor %}
11
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
12
+ {%- else %}
13
+ {%- if messages[0].role == 'system' %}
14
+ {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
15
+ {%- endif %}
16
+ {%- endif %}
17
+ {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
18
+ {%- for message in messages[::-1] %}
19
+ {%- set index = (messages|length - 1) - loop.index0 %}
20
+ {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
21
+ {%- set ns.multi_step_tool = false %}
22
+ {%- set ns.last_query_index = index %}
23
+ {%- endif %}
24
+ {%- endfor %}
25
+ {%- for message in messages %}
26
+ {%- if message.content is string %}
27
+ {%- set content = message.content %}
28
+ {%- else %}
29
+ {%- set content = '' %}
30
+ {%- endif %}
31
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
32
+ {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
33
+ {%- elif message.role == "assistant" %}
34
+ {%- set reasoning_content = '' %}
35
+ {%- if message.reasoning_content is string %}
36
+ {%- set reasoning_content = message.reasoning_content %}
37
+ {%- else %}
38
+ {%- if '</think>' in content %}
39
+ {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
40
+ {%- set content = content.split('</think>')[-1].lstrip('\n') %}
41
+ {%- endif %}
42
+ {%- endif %}
43
+ {%- if loop.index0 > ns.last_query_index %}
44
+ {%- if loop.last or (not loop.last and reasoning_content) %}
45
+ {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
46
+ {%- else %}
47
+ {{- '<|im_start|>' + message.role + '\n' + content }}
48
+ {%- endif %}
49
+ {%- else %}
50
+ {{- '<|im_start|>' + message.role + '\n' + content }}
51
+ {%- endif %}
52
+ {%- if message.tool_calls %}
53
+ {%- for tool_call in message.tool_calls %}
54
+ {%- if (loop.first and content) or (not loop.first) %}
55
+ {{- '\n' }}
56
+ {%- endif %}
57
+ {%- if tool_call.function %}
58
+ {%- set tool_call = tool_call.function %}
59
+ {%- endif %}
60
+ {{- '<tool_call>\n{"name": "' }}
61
+ {{- tool_call.name }}
62
+ {{- '", "arguments": ' }}
63
+ {%- if tool_call.arguments is string %}
64
+ {{- tool_call.arguments }}
65
+ {%- else %}
66
+ {{- tool_call.arguments | tojson }}
67
+ {%- endif %}
68
+ {{- '}\n</tool_call>' }}
69
+ {%- endfor %}
70
+ {%- endif %}
71
+ {{- '<|im_end|>\n' }}
72
+ {%- elif message.role == "tool" %}
73
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
74
+ {{- '<|im_start|>user' }}
75
+ {%- endif %}
76
+ {{- '\n<tool_response>\n' }}
77
+ {{- content }}
78
+ {{- '\n</tool_response>' }}
79
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
80
+ {{- '<|im_end|>\n' }}
81
+ {%- endif %}
82
+ {%- endif %}
83
+ {%- endfor %}
84
+ {%- if add_generation_prompt %}
85
+ {{- '<|im_start|>assistant\n' }}
86
+ {{- '<think>\n\n</think>\n\n' }}
87
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "HybridQwen3ForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 151643,
8
+ "dtype": "bfloat16",
9
+ "eos_token_id": 151645,
10
+ "head_dim": 128,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 4096,
13
+ "hybrid_override_pattern": "*-M2-M2-M2-M2-M2-M2-*-*-M2-M2-M2-*-*-*-*-*-*-*-*-*-*-*-*-*-*-M2-M2-M2-*-M2-M2-M2-M2-M2-M2",
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 12288,
16
+ "layer_types": [
17
+ "*",
18
+ "M2",
19
+ "M2",
20
+ "M2",
21
+ "M2",
22
+ "M2",
23
+ "M2",
24
+ "*",
25
+ "*",
26
+ "M2",
27
+ "M2",
28
+ "M2",
29
+ "*",
30
+ "*",
31
+ "*",
32
+ "*",
33
+ "*",
34
+ "*",
35
+ "*",
36
+ "*",
37
+ "*",
38
+ "*",
39
+ "*",
40
+ "*",
41
+ "*",
42
+ "*",
43
+ "M2",
44
+ "M2",
45
+ "M2",
46
+ "*",
47
+ "M2",
48
+ "M2",
49
+ "M2",
50
+ "M2",
51
+ "M2",
52
+ "M2"
53
+ ],
54
+ "mamba2_config": {
55
+ "d_inner": 4096,
56
+ "d_model": 4096,
57
+ "d_xb": 1024,
58
+ "hidden_act": "silu",
59
+ "intermediate_size": 12288,
60
+ "n_layer": 36,
61
+ "rms_norm_eps": 1e-06,
62
+ "ssm_cfg": {
63
+ "d_state": 128,
64
+ "expand": 1,
65
+ "ngroups": 32
66
+ },
67
+ "use_pos_emb": false,
68
+ "use_qk_norm": true
69
+ },
70
+ "max_position_embeddings": 131072,
71
+ "max_window_layers": 36,
72
+ "model_type": "hybrid_qwen3",
73
+ "num_attention_heads": 32,
74
+ "num_hidden_layers": 36,
75
+ "num_key_value_heads": 8,
76
+ "pad_token_id": null,
77
+ "rms_norm_eps": 1e-06,
78
+ "rope_parameters": {
79
+ "rope_theta": 5000000,
80
+ "rope_type": "default"
81
+ },
82
+ "sliding_window": null,
83
+ "tie_word_embeddings": false,
84
+ "transformers_version": "5.3.0",
85
+ "use_cache": false,
86
+ "use_sliding_window": false,
87
+ "vocab_size": 151936
88
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 151643,
4
+ "eos_token_id": 151645,
5
+ "transformers_version": "4.51.3"
6
+ }
model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3d599114db34a8209c3a4db4131abab5f022729e05e7f20272b0d5844db7581d
3
+ size 4921044208
model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b46584cc12533ae662dc38547db55f0329c0575e2acd47188a86987d0c612b51
3
+ size 4916923920
model-00003-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:94efbe687ff50d4c3114b617279674ae9b86d51ff8b9a557e53c640cb5c17c17
3
+ size 4951506888
model-00004-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:509f87004edbb1ebdb914a344edcaf25f0fa0c4fefd780736f6a806faadee9fb
3
+ size 2202003496
model.safetensors.index.json ADDED
@@ -0,0 +1,478 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 16991425920
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "model-00004-of-00004.safetensors",
7
+ "model.embed_tokens.weight": "model-00001-of-00004.safetensors",
8
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
9
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
10
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
11
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
12
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
13
+ "model.layers.0.self_attn.k_norm.weight": "model-00001-of-00004.safetensors",
14
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
15
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
16
+ "model.layers.0.self_attn.q_norm.weight": "model-00001-of-00004.safetensors",
17
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
18
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
19
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
20
+ "model.layers.1.mamba.A_log": "model-00001-of-00004.safetensors",
21
+ "model.layers.1.mamba.B_norm.weight": "model-00001-of-00004.safetensors",
22
+ "model.layers.1.mamba.C_norm.weight": "model-00001-of-00004.safetensors",
23
+ "model.layers.1.mamba.D": "model-00001-of-00004.safetensors",
24
+ "model.layers.1.mamba.conv1d.bias": "model-00001-of-00004.safetensors",
25
+ "model.layers.1.mamba.conv1d.weight": "model-00001-of-00004.safetensors",
26
+ "model.layers.1.mamba.dt_bias": "model-00001-of-00004.safetensors",
27
+ "model.layers.1.mamba.in_proj.weight": "model-00001-of-00004.safetensors",
28
+ "model.layers.1.mamba.norm.weight": "model-00001-of-00004.safetensors",
29
+ "model.layers.1.mamba.out_proj.weight": "model-00001-of-00004.safetensors",
30
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
31
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
32
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
33
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
34
+ "model.layers.10.input_layernorm.weight": "model-00002-of-00004.safetensors",
35
+ "model.layers.10.mamba.A_log": "model-00002-of-00004.safetensors",
36
+ "model.layers.10.mamba.B_norm.weight": "model-00002-of-00004.safetensors",
37
+ "model.layers.10.mamba.C_norm.weight": "model-00002-of-00004.safetensors",
38
+ "model.layers.10.mamba.D": "model-00002-of-00004.safetensors",
39
+ "model.layers.10.mamba.conv1d.bias": "model-00002-of-00004.safetensors",
40
+ "model.layers.10.mamba.conv1d.weight": "model-00002-of-00004.safetensors",
41
+ "model.layers.10.mamba.dt_bias": "model-00002-of-00004.safetensors",
42
+ "model.layers.10.mamba.in_proj.weight": "model-00002-of-00004.safetensors",
43
+ "model.layers.10.mamba.norm.weight": "model-00002-of-00004.safetensors",
44
+ "model.layers.10.mamba.out_proj.weight": "model-00002-of-00004.safetensors",
45
+ "model.layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
46
+ "model.layers.10.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
47
+ "model.layers.10.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
48
+ "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
49
+ "model.layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
50
+ "model.layers.11.mamba.A_log": "model-00002-of-00004.safetensors",
51
+ "model.layers.11.mamba.B_norm.weight": "model-00002-of-00004.safetensors",
52
+ "model.layers.11.mamba.C_norm.weight": "model-00002-of-00004.safetensors",
53
+ "model.layers.11.mamba.D": "model-00002-of-00004.safetensors",
54
+ "model.layers.11.mamba.conv1d.bias": "model-00002-of-00004.safetensors",
55
+ "model.layers.11.mamba.conv1d.weight": "model-00002-of-00004.safetensors",
56
+ "model.layers.11.mamba.dt_bias": "model-00002-of-00004.safetensors",
57
+ "model.layers.11.mamba.in_proj.weight": "model-00002-of-00004.safetensors",
58
+ "model.layers.11.mamba.norm.weight": "model-00002-of-00004.safetensors",
59
+ "model.layers.11.mamba.out_proj.weight": "model-00002-of-00004.safetensors",
60
+ "model.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
61
+ "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
62
+ "model.layers.11.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
63
+ "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
64
+ "model.layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
65
+ "model.layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
66
+ "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
67
+ "model.layers.12.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
68
+ "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
69
+ "model.layers.12.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
70
+ "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
71
+ "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
72
+ "model.layers.12.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
73
+ "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
74
+ "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
75
+ "model.layers.13.input_layernorm.weight": "model-00002-of-00004.safetensors",
76
+ "model.layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
77
+ "model.layers.13.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
78
+ "model.layers.13.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
79
+ "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
80
+ "model.layers.13.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
81
+ "model.layers.13.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
82
+ "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
83
+ "model.layers.13.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
84
+ "model.layers.13.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
85
+ "model.layers.13.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
86
+ "model.layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
87
+ "model.layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
88
+ "model.layers.14.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
89
+ "model.layers.14.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
90
+ "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
91
+ "model.layers.14.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
92
+ "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
93
+ "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
94
+ "model.layers.14.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
95
+ "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
96
+ "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
97
+ "model.layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
98
+ "model.layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
99
+ "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
100
+ "model.layers.15.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
101
+ "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
102
+ "model.layers.15.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
103
+ "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
104
+ "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
105
+ "model.layers.15.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
106
+ "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
107
+ "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
108
+ "model.layers.16.input_layernorm.weight": "model-00002-of-00004.safetensors",
109
+ "model.layers.16.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
110
+ "model.layers.16.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
111
+ "model.layers.16.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
112
+ "model.layers.16.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
113
+ "model.layers.16.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
114
+ "model.layers.16.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
115
+ "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
116
+ "model.layers.16.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
117
+ "model.layers.16.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
118
+ "model.layers.16.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
119
+ "model.layers.17.input_layernorm.weight": "model-00002-of-00004.safetensors",
120
+ "model.layers.17.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
121
+ "model.layers.17.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
122
+ "model.layers.17.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
123
+ "model.layers.17.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
124
+ "model.layers.17.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
125
+ "model.layers.17.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
126
+ "model.layers.17.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
127
+ "model.layers.17.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
128
+ "model.layers.17.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
129
+ "model.layers.17.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
130
+ "model.layers.18.input_layernorm.weight": "model-00002-of-00004.safetensors",
131
+ "model.layers.18.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
132
+ "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
133
+ "model.layers.18.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
134
+ "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
135
+ "model.layers.18.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
136
+ "model.layers.18.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
137
+ "model.layers.18.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
138
+ "model.layers.18.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
139
+ "model.layers.18.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
140
+ "model.layers.18.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
141
+ "model.layers.19.input_layernorm.weight": "model-00002-of-00004.safetensors",
142
+ "model.layers.19.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
143
+ "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
144
+ "model.layers.19.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
145
+ "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
146
+ "model.layers.19.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
147
+ "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
148
+ "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
149
+ "model.layers.19.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
150
+ "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
151
+ "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
152
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
153
+ "model.layers.2.mamba.A_log": "model-00001-of-00004.safetensors",
154
+ "model.layers.2.mamba.B_norm.weight": "model-00001-of-00004.safetensors",
155
+ "model.layers.2.mamba.C_norm.weight": "model-00001-of-00004.safetensors",
156
+ "model.layers.2.mamba.D": "model-00001-of-00004.safetensors",
157
+ "model.layers.2.mamba.conv1d.bias": "model-00001-of-00004.safetensors",
158
+ "model.layers.2.mamba.conv1d.weight": "model-00001-of-00004.safetensors",
159
+ "model.layers.2.mamba.dt_bias": "model-00001-of-00004.safetensors",
160
+ "model.layers.2.mamba.in_proj.weight": "model-00001-of-00004.safetensors",
161
+ "model.layers.2.mamba.norm.weight": "model-00001-of-00004.safetensors",
162
+ "model.layers.2.mamba.out_proj.weight": "model-00001-of-00004.safetensors",
163
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
164
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
165
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
166
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
167
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00004.safetensors",
168
+ "model.layers.20.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
169
+ "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
170
+ "model.layers.20.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
171
+ "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
172
+ "model.layers.20.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
173
+ "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
174
+ "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
175
+ "model.layers.20.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
176
+ "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
177
+ "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
178
+ "model.layers.21.input_layernorm.weight": "model-00003-of-00004.safetensors",
179
+ "model.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
180
+ "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
181
+ "model.layers.21.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
182
+ "model.layers.21.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
183
+ "model.layers.21.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
184
+ "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
185
+ "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
186
+ "model.layers.21.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
187
+ "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
188
+ "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
189
+ "model.layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
190
+ "model.layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
191
+ "model.layers.22.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
192
+ "model.layers.22.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
193
+ "model.layers.22.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
194
+ "model.layers.22.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
195
+ "model.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
196
+ "model.layers.22.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
197
+ "model.layers.22.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
198
+ "model.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
199
+ "model.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
200
+ "model.layers.23.input_layernorm.weight": "model-00003-of-00004.safetensors",
201
+ "model.layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
202
+ "model.layers.23.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
203
+ "model.layers.23.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
204
+ "model.layers.23.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
205
+ "model.layers.23.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
206
+ "model.layers.23.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
207
+ "model.layers.23.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
208
+ "model.layers.23.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
209
+ "model.layers.23.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
210
+ "model.layers.23.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
211
+ "model.layers.24.input_layernorm.weight": "model-00003-of-00004.safetensors",
212
+ "model.layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
213
+ "model.layers.24.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
214
+ "model.layers.24.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
215
+ "model.layers.24.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
216
+ "model.layers.24.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
217
+ "model.layers.24.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
218
+ "model.layers.24.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
219
+ "model.layers.24.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
220
+ "model.layers.24.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
221
+ "model.layers.24.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
222
+ "model.layers.25.input_layernorm.weight": "model-00003-of-00004.safetensors",
223
+ "model.layers.25.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
224
+ "model.layers.25.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
225
+ "model.layers.25.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
226
+ "model.layers.25.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
227
+ "model.layers.25.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
228
+ "model.layers.25.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
229
+ "model.layers.25.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
230
+ "model.layers.25.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
231
+ "model.layers.25.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
232
+ "model.layers.25.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
233
+ "model.layers.26.input_layernorm.weight": "model-00003-of-00004.safetensors",
234
+ "model.layers.26.mamba.A_log": "model-00003-of-00004.safetensors",
235
+ "model.layers.26.mamba.B_norm.weight": "model-00003-of-00004.safetensors",
236
+ "model.layers.26.mamba.C_norm.weight": "model-00003-of-00004.safetensors",
237
+ "model.layers.26.mamba.D": "model-00003-of-00004.safetensors",
238
+ "model.layers.26.mamba.conv1d.bias": "model-00003-of-00004.safetensors",
239
+ "model.layers.26.mamba.conv1d.weight": "model-00003-of-00004.safetensors",
240
+ "model.layers.26.mamba.dt_bias": "model-00003-of-00004.safetensors",
241
+ "model.layers.26.mamba.in_proj.weight": "model-00003-of-00004.safetensors",
242
+ "model.layers.26.mamba.norm.weight": "model-00003-of-00004.safetensors",
243
+ "model.layers.26.mamba.out_proj.weight": "model-00003-of-00004.safetensors",
244
+ "model.layers.26.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
245
+ "model.layers.26.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
246
+ "model.layers.26.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
247
+ "model.layers.26.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
248
+ "model.layers.27.input_layernorm.weight": "model-00003-of-00004.safetensors",
249
+ "model.layers.27.mamba.A_log": "model-00003-of-00004.safetensors",
250
+ "model.layers.27.mamba.B_norm.weight": "model-00003-of-00004.safetensors",
251
+ "model.layers.27.mamba.C_norm.weight": "model-00003-of-00004.safetensors",
252
+ "model.layers.27.mamba.D": "model-00003-of-00004.safetensors",
253
+ "model.layers.27.mamba.conv1d.bias": "model-00003-of-00004.safetensors",
254
+ "model.layers.27.mamba.conv1d.weight": "model-00003-of-00004.safetensors",
255
+ "model.layers.27.mamba.dt_bias": "model-00003-of-00004.safetensors",
256
+ "model.layers.27.mamba.in_proj.weight": "model-00003-of-00004.safetensors",
257
+ "model.layers.27.mamba.norm.weight": "model-00003-of-00004.safetensors",
258
+ "model.layers.27.mamba.out_proj.weight": "model-00003-of-00004.safetensors",
259
+ "model.layers.27.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
260
+ "model.layers.27.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
261
+ "model.layers.27.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
262
+ "model.layers.27.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
263
+ "model.layers.28.input_layernorm.weight": "model-00003-of-00004.safetensors",
264
+ "model.layers.28.mamba.A_log": "model-00003-of-00004.safetensors",
265
+ "model.layers.28.mamba.B_norm.weight": "model-00003-of-00004.safetensors",
266
+ "model.layers.28.mamba.C_norm.weight": "model-00003-of-00004.safetensors",
267
+ "model.layers.28.mamba.D": "model-00003-of-00004.safetensors",
268
+ "model.layers.28.mamba.conv1d.bias": "model-00003-of-00004.safetensors",
269
+ "model.layers.28.mamba.conv1d.weight": "model-00003-of-00004.safetensors",
270
+ "model.layers.28.mamba.dt_bias": "model-00003-of-00004.safetensors",
271
+ "model.layers.28.mamba.in_proj.weight": "model-00003-of-00004.safetensors",
272
+ "model.layers.28.mamba.norm.weight": "model-00003-of-00004.safetensors",
273
+ "model.layers.28.mamba.out_proj.weight": "model-00003-of-00004.safetensors",
274
+ "model.layers.28.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
275
+ "model.layers.28.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
276
+ "model.layers.28.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
277
+ "model.layers.28.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
278
+ "model.layers.29.input_layernorm.weight": "model-00003-of-00004.safetensors",
279
+ "model.layers.29.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
280
+ "model.layers.29.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
281
+ "model.layers.29.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
282
+ "model.layers.29.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
283
+ "model.layers.29.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
284
+ "model.layers.29.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
285
+ "model.layers.29.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
286
+ "model.layers.29.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
287
+ "model.layers.29.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
288
+ "model.layers.29.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
289
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
290
+ "model.layers.3.mamba.A_log": "model-00001-of-00004.safetensors",
291
+ "model.layers.3.mamba.B_norm.weight": "model-00001-of-00004.safetensors",
292
+ "model.layers.3.mamba.C_norm.weight": "model-00001-of-00004.safetensors",
293
+ "model.layers.3.mamba.D": "model-00001-of-00004.safetensors",
294
+ "model.layers.3.mamba.conv1d.bias": "model-00001-of-00004.safetensors",
295
+ "model.layers.3.mamba.conv1d.weight": "model-00001-of-00004.safetensors",
296
+ "model.layers.3.mamba.dt_bias": "model-00001-of-00004.safetensors",
297
+ "model.layers.3.mamba.in_proj.weight": "model-00001-of-00004.safetensors",
298
+ "model.layers.3.mamba.norm.weight": "model-00001-of-00004.safetensors",
299
+ "model.layers.3.mamba.out_proj.weight": "model-00001-of-00004.safetensors",
300
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
301
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
302
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
303
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
304
+ "model.layers.30.input_layernorm.weight": "model-00003-of-00004.safetensors",
305
+ "model.layers.30.mamba.A_log": "model-00003-of-00004.safetensors",
306
+ "model.layers.30.mamba.B_norm.weight": "model-00003-of-00004.safetensors",
307
+ "model.layers.30.mamba.C_norm.weight": "model-00003-of-00004.safetensors",
308
+ "model.layers.30.mamba.D": "model-00003-of-00004.safetensors",
309
+ "model.layers.30.mamba.conv1d.bias": "model-00003-of-00004.safetensors",
310
+ "model.layers.30.mamba.conv1d.weight": "model-00003-of-00004.safetensors",
311
+ "model.layers.30.mamba.dt_bias": "model-00003-of-00004.safetensors",
312
+ "model.layers.30.mamba.in_proj.weight": "model-00003-of-00004.safetensors",
313
+ "model.layers.30.mamba.norm.weight": "model-00003-of-00004.safetensors",
314
+ "model.layers.30.mamba.out_proj.weight": "model-00003-of-00004.safetensors",
315
+ "model.layers.30.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
316
+ "model.layers.30.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
317
+ "model.layers.30.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
318
+ "model.layers.30.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
319
+ "model.layers.31.input_layernorm.weight": "model-00003-of-00004.safetensors",
320
+ "model.layers.31.mamba.A_log": "model-00003-of-00004.safetensors",
321
+ "model.layers.31.mamba.B_norm.weight": "model-00003-of-00004.safetensors",
322
+ "model.layers.31.mamba.C_norm.weight": "model-00003-of-00004.safetensors",
323
+ "model.layers.31.mamba.D": "model-00003-of-00004.safetensors",
324
+ "model.layers.31.mamba.conv1d.bias": "model-00003-of-00004.safetensors",
325
+ "model.layers.31.mamba.conv1d.weight": "model-00003-of-00004.safetensors",
326
+ "model.layers.31.mamba.dt_bias": "model-00003-of-00004.safetensors",
327
+ "model.layers.31.mamba.in_proj.weight": "model-00003-of-00004.safetensors",
328
+ "model.layers.31.mamba.norm.weight": "model-00003-of-00004.safetensors",
329
+ "model.layers.31.mamba.out_proj.weight": "model-00003-of-00004.safetensors",
330
+ "model.layers.31.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
331
+ "model.layers.31.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
332
+ "model.layers.31.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
333
+ "model.layers.31.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
334
+ "model.layers.32.input_layernorm.weight": "model-00003-of-00004.safetensors",
335
+ "model.layers.32.mamba.A_log": "model-00003-of-00004.safetensors",
336
+ "model.layers.32.mamba.B_norm.weight": "model-00003-of-00004.safetensors",
337
+ "model.layers.32.mamba.C_norm.weight": "model-00003-of-00004.safetensors",
338
+ "model.layers.32.mamba.D": "model-00003-of-00004.safetensors",
339
+ "model.layers.32.mamba.conv1d.bias": "model-00003-of-00004.safetensors",
340
+ "model.layers.32.mamba.conv1d.weight": "model-00003-of-00004.safetensors",
341
+ "model.layers.32.mamba.dt_bias": "model-00003-of-00004.safetensors",
342
+ "model.layers.32.mamba.in_proj.weight": "model-00003-of-00004.safetensors",
343
+ "model.layers.32.mamba.norm.weight": "model-00003-of-00004.safetensors",
344
+ "model.layers.32.mamba.out_proj.weight": "model-00003-of-00004.safetensors",
345
+ "model.layers.32.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
346
+ "model.layers.32.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
347
+ "model.layers.32.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
348
+ "model.layers.32.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
349
+ "model.layers.33.input_layernorm.weight": "model-00003-of-00004.safetensors",
350
+ "model.layers.33.mamba.A_log": "model-00003-of-00004.safetensors",
351
+ "model.layers.33.mamba.B_norm.weight": "model-00004-of-00004.safetensors",
352
+ "model.layers.33.mamba.C_norm.weight": "model-00004-of-00004.safetensors",
353
+ "model.layers.33.mamba.D": "model-00003-of-00004.safetensors",
354
+ "model.layers.33.mamba.conv1d.bias": "model-00004-of-00004.safetensors",
355
+ "model.layers.33.mamba.conv1d.weight": "model-00004-of-00004.safetensors",
356
+ "model.layers.33.mamba.dt_bias": "model-00003-of-00004.safetensors",
357
+ "model.layers.33.mamba.in_proj.weight": "model-00004-of-00004.safetensors",
358
+ "model.layers.33.mamba.norm.weight": "model-00004-of-00004.safetensors",
359
+ "model.layers.33.mamba.out_proj.weight": "model-00004-of-00004.safetensors",
360
+ "model.layers.33.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
361
+ "model.layers.33.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
362
+ "model.layers.33.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
363
+ "model.layers.33.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
364
+ "model.layers.34.input_layernorm.weight": "model-00004-of-00004.safetensors",
365
+ "model.layers.34.mamba.A_log": "model-00004-of-00004.safetensors",
366
+ "model.layers.34.mamba.B_norm.weight": "model-00004-of-00004.safetensors",
367
+ "model.layers.34.mamba.C_norm.weight": "model-00004-of-00004.safetensors",
368
+ "model.layers.34.mamba.D": "model-00004-of-00004.safetensors",
369
+ "model.layers.34.mamba.conv1d.bias": "model-00004-of-00004.safetensors",
370
+ "model.layers.34.mamba.conv1d.weight": "model-00004-of-00004.safetensors",
371
+ "model.layers.34.mamba.dt_bias": "model-00004-of-00004.safetensors",
372
+ "model.layers.34.mamba.in_proj.weight": "model-00004-of-00004.safetensors",
373
+ "model.layers.34.mamba.norm.weight": "model-00004-of-00004.safetensors",
374
+ "model.layers.34.mamba.out_proj.weight": "model-00004-of-00004.safetensors",
375
+ "model.layers.34.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
376
+ "model.layers.34.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
377
+ "model.layers.34.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
378
+ "model.layers.34.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
379
+ "model.layers.35.input_layernorm.weight": "model-00004-of-00004.safetensors",
380
+ "model.layers.35.mamba.A_log": "model-00004-of-00004.safetensors",
381
+ "model.layers.35.mamba.B_norm.weight": "model-00004-of-00004.safetensors",
382
+ "model.layers.35.mamba.C_norm.weight": "model-00004-of-00004.safetensors",
383
+ "model.layers.35.mamba.D": "model-00004-of-00004.safetensors",
384
+ "model.layers.35.mamba.conv1d.bias": "model-00004-of-00004.safetensors",
385
+ "model.layers.35.mamba.conv1d.weight": "model-00004-of-00004.safetensors",
386
+ "model.layers.35.mamba.dt_bias": "model-00004-of-00004.safetensors",
387
+ "model.layers.35.mamba.in_proj.weight": "model-00004-of-00004.safetensors",
388
+ "model.layers.35.mamba.norm.weight": "model-00004-of-00004.safetensors",
389
+ "model.layers.35.mamba.out_proj.weight": "model-00004-of-00004.safetensors",
390
+ "model.layers.35.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
391
+ "model.layers.35.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
392
+ "model.layers.35.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
393
+ "model.layers.35.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
394
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00004.safetensors",
395
+ "model.layers.4.mamba.A_log": "model-00001-of-00004.safetensors",
396
+ "model.layers.4.mamba.B_norm.weight": "model-00001-of-00004.safetensors",
397
+ "model.layers.4.mamba.C_norm.weight": "model-00001-of-00004.safetensors",
398
+ "model.layers.4.mamba.D": "model-00001-of-00004.safetensors",
399
+ "model.layers.4.mamba.conv1d.bias": "model-00001-of-00004.safetensors",
400
+ "model.layers.4.mamba.conv1d.weight": "model-00001-of-00004.safetensors",
401
+ "model.layers.4.mamba.dt_bias": "model-00001-of-00004.safetensors",
402
+ "model.layers.4.mamba.in_proj.weight": "model-00001-of-00004.safetensors",
403
+ "model.layers.4.mamba.norm.weight": "model-00001-of-00004.safetensors",
404
+ "model.layers.4.mamba.out_proj.weight": "model-00001-of-00004.safetensors",
405
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
406
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
407
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
408
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
409
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00004.safetensors",
410
+ "model.layers.5.mamba.A_log": "model-00001-of-00004.safetensors",
411
+ "model.layers.5.mamba.B_norm.weight": "model-00001-of-00004.safetensors",
412
+ "model.layers.5.mamba.C_norm.weight": "model-00001-of-00004.safetensors",
413
+ "model.layers.5.mamba.D": "model-00001-of-00004.safetensors",
414
+ "model.layers.5.mamba.conv1d.bias": "model-00001-of-00004.safetensors",
415
+ "model.layers.5.mamba.conv1d.weight": "model-00001-of-00004.safetensors",
416
+ "model.layers.5.mamba.dt_bias": "model-00001-of-00004.safetensors",
417
+ "model.layers.5.mamba.in_proj.weight": "model-00001-of-00004.safetensors",
418
+ "model.layers.5.mamba.norm.weight": "model-00001-of-00004.safetensors",
419
+ "model.layers.5.mamba.out_proj.weight": "model-00001-of-00004.safetensors",
420
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
421
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
422
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
423
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
424
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00004.safetensors",
425
+ "model.layers.6.mamba.A_log": "model-00001-of-00004.safetensors",
426
+ "model.layers.6.mamba.B_norm.weight": "model-00001-of-00004.safetensors",
427
+ "model.layers.6.mamba.C_norm.weight": "model-00001-of-00004.safetensors",
428
+ "model.layers.6.mamba.D": "model-00001-of-00004.safetensors",
429
+ "model.layers.6.mamba.conv1d.bias": "model-00001-of-00004.safetensors",
430
+ "model.layers.6.mamba.conv1d.weight": "model-00001-of-00004.safetensors",
431
+ "model.layers.6.mamba.dt_bias": "model-00001-of-00004.safetensors",
432
+ "model.layers.6.mamba.in_proj.weight": "model-00001-of-00004.safetensors",
433
+ "model.layers.6.mamba.norm.weight": "model-00001-of-00004.safetensors",
434
+ "model.layers.6.mamba.out_proj.weight": "model-00001-of-00004.safetensors",
435
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
436
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
437
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
438
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
439
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00004.safetensors",
440
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
441
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
442
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
443
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
444
+ "model.layers.7.self_attn.k_norm.weight": "model-00001-of-00004.safetensors",
445
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
446
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
447
+ "model.layers.7.self_attn.q_norm.weight": "model-00001-of-00004.safetensors",
448
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
449
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
450
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00004.safetensors",
451
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
452
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
453
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
454
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
455
+ "model.layers.8.self_attn.k_norm.weight": "model-00001-of-00004.safetensors",
456
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
457
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
458
+ "model.layers.8.self_attn.q_norm.weight": "model-00001-of-00004.safetensors",
459
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
460
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
461
+ "model.layers.9.input_layernorm.weight": "model-00001-of-00004.safetensors",
462
+ "model.layers.9.mamba.A_log": "model-00002-of-00004.safetensors",
463
+ "model.layers.9.mamba.B_norm.weight": "model-00002-of-00004.safetensors",
464
+ "model.layers.9.mamba.C_norm.weight": "model-00002-of-00004.safetensors",
465
+ "model.layers.9.mamba.D": "model-00002-of-00004.safetensors",
466
+ "model.layers.9.mamba.conv1d.bias": "model-00002-of-00004.safetensors",
467
+ "model.layers.9.mamba.conv1d.weight": "model-00002-of-00004.safetensors",
468
+ "model.layers.9.mamba.dt_bias": "model-00002-of-00004.safetensors",
469
+ "model.layers.9.mamba.in_proj.weight": "model-00002-of-00004.safetensors",
470
+ "model.layers.9.mamba.norm.weight": "model-00002-of-00004.safetensors",
471
+ "model.layers.9.mamba.out_proj.weight": "model-00002-of-00004.safetensors",
472
+ "model.layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
473
+ "model.layers.9.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
474
+ "model.layers.9.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
475
+ "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
476
+ "model.norm.weight": "model-00004-of-00004.safetensors"
477
+ }
478
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be75606093db2094d7cd20f3c2f385c212750648bd6ea4fb2bf507a6a4c55506
3
+ size 11422650
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": null,
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|im_end|>",
7
+ "errors": "replace",
8
+ "is_local": true,
9
+ "model_max_length": 131072,
10
+ "pad_token": "<|endoftext|>",
11
+ "padding_side": "right",
12
+ "split_special_tokens": false,
13
+ "tokenizer_class": "Qwen2Tokenizer",
14
+ "unk_token": null
15
+ }