Text Generation
Transformers
PyTorch
English
llama
text-generation-inference
Jashan887 commited on
Commit
307d30b
·
verified ·
1 Parent(s): 71ee21f

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama2
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ datasets:
7
+ - togethercomputer/llama-instruct
8
+ ---
9
+
10
+ # Llama-2-7B-32K-Instruct
11
+
12
+ ## Model Description
13
+
14
+ Llama-2-7B-32K-Instruct is an open-source, long-context chat model finetuned from [Llama-2-7B-32K](https://huggingface.co/togethercomputer/Llama-2-7B-32K), over high-quality instruction and chat data.
15
+ We built Llama-2-7B-32K-Instruct with less than 200 lines of Python script using [Together API](https://together.ai/blog/api-announcement), and we also make the [recipe fully available](https://github.com/togethercomputer/Llama-2-7B-32K-Instruct).
16
+ We hope that this can enable everyone to finetune their own version of [Llama-2-7B-32K](https://huggingface.co/togethercomputer/Llama-2-7B-32K) — play with [Together API](https://together.ai/blog/api-announcement) and give us feedback!
17
+
18
+ ## Data Collection Details
19
+
20
+ Llama-2-7B-32K-Instruct is fine-tuned over a combination of two parts:
21
+ 1. **19K single- and multi-round conversations generated by human instructions and [Llama-2-70B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) outputs**.
22
+ We collected the dataset following the distillation paradigm that is used by Alpaca, Vicuna, WizardLM, Orca — producing instructions by querying a powerful LLM (in this case, [Llama-2-70B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)).
23
+ The complete dataset is also released [here](https://huggingface.co/datasets/togethercomputer/llama-instruct).
24
+ We also share the complete recipe for the data collection process [here](https://github.com/togethercomputer/Llama-2-7B-32K-Instruct).
25
+
26
+ 2. **Long-context Summarization and Long-context QA**.
27
+ We follow the recipe of [Llama-2-7B-32K](https://together.ai/blog/Llama-2-7B-32K), and train our model with the [BookSum dataset](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections) and [Multi-document Question Answering](https://arxiv.org/abs/2307.03172).
28
+
29
+ The final data mixture used for model finetuning is: 19K instruction (50%) + BookSum (25%) + MQA (25%).
30
+
31
+ ## Model Usage
32
+
33
+ We encourage you to try out this model using the [Together API](https://together.ai/blog/api-announcement). The updated inference stack allows for efficient inference.
34
+
35
+ To run the model locally, we strongly recommend to install Flash Attention V2, which is necessary to obtain the best performance:
36
+ ```
37
+ # Please update the path of `CUDA_HOME`
38
+ export CUDA_HOME=/usr/local/cuda-11.8
39
+ pip install transformers==4.31.0
40
+ pip install sentencepiece
41
+ pip install ninja
42
+ pip install flash-attn --no-build-isolation
43
+ pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
44
+ ```
45
+ You can load the model directly from the Hugging Face model hub using
46
+ ```python
47
+ import torch
48
+ from transformers import AutoTokenizer, AutoModelForCausalLM
49
+
50
+ tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-2-7B-32K-Instruct")
51
+ model = AutoModelForCausalLM.from_pretrained("togethercomputer/Llama-2-7B-32K-Instruct",
52
+ trust_remote_code=True, torch_dtype=torch.float16)
53
+ input_ids = tokenizer.encode("[INST]\nWrite a poem about cats\n[/INST]\n\n", return_tensors="pt")
54
+ output = model.generate(input_ids, max_length=128,
55
+ temperature=0.7, repetition_penalty=1.1, top_p=0.7, top_k=50)
56
+ output_text = tokenizer.decode(output[0], skip_special_tokens=True)
57
+ ```
58
+
59
+ The model is also hosted on [Together Playground](https://api.together.xyz/playground). You can simply play with the model by using prompt formatted by:
60
+
61
+ ```
62
+ [INST]\n<your instruction here>\n[\INST]\n\n
63
+ ```
64
+
65
+ For example, if we query the model with
66
+
67
+ ```
68
+ [INST]\nWrite a poem about cats\n[\INST]\n\n
69
+ ```
70
+
71
+ the model will return
72
+ ```
73
+ [INST]
74
+ Write a poem about cats
75
+ [/INST]
76
+
77
+ Cats are mysterious creatures, with their aloof and independent nature. They're also incredibly beautiful, with their sleek fur and piercing eyes. Here's a poem that captures the essence of cats:
78
+
79
+ Cats, oh cats, how can I describe you?
80
+ Your beauty is beyond compare, it seems.
81
+ You're graceful and elegant, like a ballerina's dance,
82
+ But don't let your charm fool you, for you're not easily tamed.
83
+
84
+ With your soft purring and playful meows,
85
+ You draw us in with your enchanting powers.
86
+ We love to watch you play, your tail twirling 'round,
87
+ As if you're dancing on air, with no sound.
88
+
89
+ But don't be fooled by your sweetness, my friend,
90
+ For beneath that gentle exterior, lies a fierce defender.
91
+ When danger lurks, you'll spring into action,
92
+ Protecting those you hold dear, without question.
93
+
94
+ Solet us admire you, from afar,
95
+ For in your own way, you're truly unique, a star.
96
+ And though we may never fully understand,
97
+ The depths of your soul, we'll always stand, hand in paw, as one.
98
+
99
+ This poem captures the essence of cats, highlighting their beauty, independence,and protective nature. It also celebrates the special bond between humans and cats, recognizing their unique qualities and the joy they bring to our lives.
100
+ ```
101
+
102
+ ## Model Evaluation
103
+
104
+ We evaluate the model from three aspects: 1) [Alpaca Eval](https://tatsu-lab.github.io/alpaca_eval/);
105
+ 2) [Rouge score over BookSum](https://together.ai/blog/Llama-2-7B-32K); and
106
+ 3) [Accuracy over Multi-document Question Answering (MQA)](https://together.ai/blog/Llama-2-7B-32K).
107
+ We compare with models including
108
+ [GPT-3.5-Turbo-16K](https://platform.openai.com/docs/models/gpt-3-5),
109
+ [https://huggingface.co/meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf),
110
+ [Longchat-7b-16k](https://huggingface.co/lmsys/longchat-7b-16k)
111
+ and [Longchat-7b-v1.5-32k](https://huggingface.co/lmsys/longchat-7b-v1.5-32k).
112
+ We summarize the results below:
113
+
114
+ * Alpaca Eval
115
+ | Model | win_rate | standard_error | n_total | avg_length |
116
+ | -------- | ------- | ------- | ------- | ------- |
117
+ | Llama-2-7B-Chat-hf | 71.37 | 1.59 | 805 | 1479 |
118
+ | Llama-2-7B-32K-Instruct | 70.36 | 1.61 | 803 | 1885 |
119
+ | oasst-rlhf-llama-33b | 66.52 | 1.66 | 805 | 1079 |
120
+ | text_davinci_003 | 50.00 | 0.00 | 805 | 307|
121
+ | falcon-40b-instruct | 45.71 | 1.75 | 805 | 662 |
122
+ | alpaca-farm-ppo-human | 41.24 | 1.73 | 805 | 803 |
123
+ | alpaca-7b | 26.46 | 1.54 | 805 | 396 |
124
+ | text_davinci_001 | 15.17 | 1.24 | 804 | 296 |
125
+
126
+ * Rouge Score over BookSum
127
+ | Model | R1 | R2 | RL |
128
+ | -------- | ------- | ------- | ------- |
129
+ | Llama-2-7B-Chat-hf | 0.055 | 0.008 | 0.046 |
130
+ | Longchat-7b-16k | 0.303 | 0.055 | 0.160 |
131
+ | Longchat-7b-v1.5-32k | 0.308 | 0.057 | 0.163 |
132
+ | GPT-3.5-Turbo-16K | 0.324 | 0.066 | 0.178 |
133
+ | Llama-2-7B-32K-Instruct (ours) | 0.336 | 0.076 | 0.184 |
134
+
135
+ * Accuracy over MQA
136
+ | Model | 20 docs (Avg 2.9K tokens) | 30 docs (Avg 4.4K tokens) | 50 docs (Avg 7.4K tokens) |
137
+ | -------- | ------- | ------- | ------- |
138
+ | Llama-2-7B-Chat-hf | 0.448 | 0.421 | 0.354 |
139
+ | Longchat-7b-16k | 0.510 | 0.473 | 0.428 |
140
+ | Longchat-7b-v1.5-32k | 0.534 | 0.516 | 0.479 |
141
+ | GPT-3.5-Turbo-16K | 0.622 | 0.609 | 0.577 |
142
+ | Llama-2-7B-32K-Instruct (ours) | 0.622 | 0.604 | 0.589 |
143
+
144
+ ## Limitations and Bias
145
+
146
+ As with all language models, Llama-2-7B-32K-Instruct may generate incorrect or biased content. It's important to keep this in mind when using the model.
147
+
148
+ ## Community
149
+
150
+ Join us on [Together Discord](https://discord.gg/6ZVDU8tTD4)
added_tokens.json ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ {
2
+ }
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "bos_token_id": 1,
6
+ "eos_token_id": 2,
7
+ "hidden_act": "silu",
8
+ "hidden_size": 4096,
9
+ "initializer_range": 0.02,
10
+ "intermediate_size": 11008,
11
+ "max_position_embeddings": 32768,
12
+ "model_type": "llama",
13
+ "num_attention_heads": 32,
14
+ "num_hidden_layers": 32,
15
+ "num_key_value_heads": 32,
16
+ "pad_token_id": 0,
17
+ "pretraining_tp": 1,
18
+ "rms_norm_eps": 1e-05,
19
+ "rope_scaling": {
20
+ "factor": 8.0,
21
+ "type": "linear"
22
+ },
23
+ "tie_word_embeddings": false,
24
+ "torch_dtype": "float16",
25
+ "transformers_version": "4.31.0",
26
+ "use_cache": true,
27
+ "vocab_size": 32000
28
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.31.0"
7
+ }
modeling_flash_llama.py ADDED
@@ -0,0 +1,1012 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ PyTorch LLaMA model."""
21
+ import math
22
+ from typing import List, Optional, Tuple, Union
23
+
24
+ import torch
25
+ import torch.nn.functional as F
26
+ import torch.utils.checkpoint
27
+ from torch import nn
28
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
29
+
30
+ from transformers.activations import ACT2FN
31
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
32
+ from transformers.modeling_utils import PreTrainedModel
33
+ from transformers.utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
34
+ from transformers.models.llama.configuration_llama import LlamaConfig
35
+
36
+
37
+ try:
38
+ from flash_attn.flash_attn_interface import (
39
+ flash_attn_func,
40
+ flash_attn_kvpacked_func,
41
+ flash_attn_qkvpacked_func,
42
+ flash_attn_varlen_kvpacked_func,
43
+ )
44
+ from flash_attn.bert_padding import unpad_input, pad_input
45
+ flash_attn_v2_installed = True
46
+ print('>>>> Flash Attention installed')
47
+ except ImportError:
48
+ flash_attn_v2_installed = False
49
+ raise ImportError('Please install Flash Attention: `pip install flash-attn --no-build-isolation`')
50
+
51
+ try:
52
+ from flash_attn.layers.rotary import apply_rotary_emb_func
53
+ flash_rope_installed = True
54
+ print('>>>> Flash RoPE installed')
55
+ except ImportError:
56
+ flash_rope_installed = False
57
+ raise ImportError('Please install RoPE kernels: `pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary`')
58
+
59
+
60
+ logger = logging.get_logger(__name__)
61
+
62
+ _CONFIG_FOR_DOC = "LlamaConfig"
63
+
64
+
65
+ # @torch.jit.script
66
+ def rmsnorm_func(hidden_states, weight, variance_epsilon):
67
+ input_dtype = hidden_states.dtype
68
+ hidden_states = hidden_states.to(torch.float32)
69
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
70
+ hidden_states = hidden_states * torch.rsqrt(variance + variance_epsilon)
71
+ return (weight * hidden_states).to(input_dtype)
72
+
73
+
74
+ class LlamaRMSNorm(nn.Module):
75
+ def __init__(self, hidden_size, eps=1e-6):
76
+ """
77
+ LlamaRMSNorm is equivalent to T5LayerNorm
78
+ """
79
+ super().__init__()
80
+ self.weight = nn.Parameter(torch.ones(hidden_size))
81
+ self.register_buffer(
82
+ "variance_epsilon",
83
+ torch.tensor(eps),
84
+ persistent=False,
85
+ )
86
+
87
+ def forward(self, hidden_states):
88
+ return rmsnorm_func(hidden_states, self.weight, self.variance_epsilon)
89
+
90
+
91
+ class FlashRotaryEmbedding(torch.nn.Module):
92
+ """
93
+ The rotary position embeddings from RoFormer_ (Su et. al).
94
+ A crucial insight from the method is that the query and keys are
95
+ transformed by rotation matrices which depend on the relative positions.
96
+
97
+ Other implementations are available in the Rotary Transformer repo_ and in
98
+ GPT-NeoX_, GPT-NeoX was an inspiration
99
+
100
+ .. _RoFormer: https://arxiv.org/abs/2104.09864
101
+ .. _repo: https://github.com/ZhuiyiTechnology/roformer
102
+ .. _GPT-NeoX: https://github.com/EleutherAI/gpt-neox
103
+
104
+ If scale_base is not None, this implements XPos (Sun et al., https://arxiv.org/abs/2212.10554).
105
+ A recommended value for scale_base is 512: https://github.com/HazyResearch/flash-attention/issues/96
106
+ Reference: https://github.com/sunyt32/torchscale/blob/main/torchscale/component/xpos_relative_position.py
107
+ """
108
+
109
+ def __init__(self, dim: int, base=10000.0, interleaved=False, scale_base=None,
110
+ scaling_factor=1.0, pos_idx_in_fp32=True, device=None):
111
+ """
112
+ interleaved: if True, rotate pairs of even and odd dimensions (GPT-J style) instead
113
+ of 1st half and 2nd half (GPT-NeoX style).
114
+ pos_idx_in_fp32: if True, the position indices [0.0, ..., seqlen - 1] are in fp32,
115
+ otherwise they might be in lower precision.
116
+ This option was added because previously (before 2023-07-02), when we construct
117
+ the position indices, we use the dtype of self.inv_freq. In most cases this would
118
+ be fp32, but if the model is trained in pure bf16 (not mixed precision), then
119
+ self.inv_freq would be bf16, and the position indices are also in bf16.
120
+ Because of the limited precision of bf16 (e.g. 1995.0 is rounded to 2000.0), the
121
+ embeddings for some positions will coincide.
122
+ To maintain compatibility with models previously trained in pure bf16,
123
+ we add this option.
124
+ scaling_factor: RotaryEmbedding extended with linear scaling.
125
+ """
126
+ super().__init__()
127
+ self.dim = dim
128
+ self.base = float(base)
129
+ self.pos_idx_in_fp32 = pos_idx_in_fp32
130
+ # Generate and save the inverse frequency buffer (non trainable)
131
+ inv_freq = self._compute_inv_freq(device)
132
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
133
+ self.interleaved = interleaved
134
+ self.scale_base = scale_base
135
+ self.scaling_factor = scaling_factor
136
+ scale = ((torch.arange(0, dim, 2, device=device, dtype=torch.float32) + 0.4 * dim)
137
+ / (1.4 * dim) if scale_base is not None else None)
138
+ self.register_buffer("scale", scale)
139
+
140
+ self._seq_len_cached = 0
141
+ self._cos_cached = None
142
+ self._sin_cached = None
143
+ self._cos_k_cached = None
144
+ self._sin_k_cached = None
145
+
146
+ def _compute_inv_freq(self, device=None):
147
+ return 1 / (self.base ** (torch.arange(0, self.dim, 2, device=device,
148
+ dtype=torch.float32) / self.dim))
149
+
150
+
151
+ def _update_cos_sin_cache(self, seqlen, device=None, dtype=None):
152
+ # Reset the tables if the sequence length has changed,
153
+ # if we're on a new device (possibly due to tracing for instance),
154
+ # or if we're switching from inference mode to training
155
+ if (seqlen > self._seq_len_cached or self._cos_cached.device != device
156
+ or self._cos_cached.dtype != dtype
157
+ or (self.training and self._cos_cached.is_inference())):
158
+ self._seq_len_cached = seqlen
159
+ # We want fp32 here, not self.inv_freq.dtype, since the model could be loaded in bf16
160
+ # And the output of arange can be quite large, so bf16 would lose a lot of precision.
161
+ # However, for compatibility reason, we add an option to use the dtype of self.inv_freq.
162
+ if self.pos_idx_in_fp32:
163
+ t = torch.arange(seqlen, device=device, dtype=torch.float32)
164
+ t /= self.scaling_factor
165
+ # We want fp32 here as well since inv_freq will be multiplied with t, and the output
166
+ # will be large. Having it in bf16 will lose a lot of precision and cause the
167
+ # cos & sin output to change significantly.
168
+ # We want to recompute self.inv_freq if it was not loaded in fp32
169
+ if self.inv_freq.dtype != torch.float32:
170
+ inv_freq = self.inv_freq.to(torch.float32)
171
+ else:
172
+ inv_freq = self.inv_freq
173
+ else:
174
+ t = torch.arange(seqlen, device=device, dtype=self.inv_freq.dtype)
175
+ t /= self.scaling_factor
176
+ inv_freq = self.inv_freq
177
+ # Don't do einsum, it converts fp32 to fp16 under AMP
178
+ # freqs = torch.einsum("i,j->ij", t, self.inv_freq)
179
+ freqs = torch.outer(t, inv_freq)
180
+ if self.scale is None:
181
+ self._cos_cached = torch.cos(freqs).to(dtype)
182
+ self._sin_cached = torch.sin(freqs).to(dtype)
183
+ else:
184
+ power = ((torch.arange(seqlen, dtype=self.scale.dtype, device=self.scale.device)
185
+ - seqlen // 2) / self.scale_base)
186
+ scale = self.scale.to(device=power.device) ** power.unsqueeze(-1)
187
+ # We want the multiplication by scale to happen in fp32
188
+ self._cos_cached = (torch.cos(freqs) * scale).to(dtype)
189
+ self._sin_cached = (torch.sin(freqs) * scale).to(dtype)
190
+ self._cos_k_cached = (torch.cos(freqs) / scale).to(dtype)
191
+ self._sin_k_cached = (torch.sin(freqs) / scale).to(dtype)
192
+
193
+ def forward(self, q: torch.Tensor, k: torch.Tensor, seqlen_offset: int = 0) -> Tuple[torch.Tensor, torch.Tensor]:
194
+ """
195
+ q: (batch, seqlen, nheads, headdim)
196
+ k: (batch, seqlen, nheads, headdim)
197
+ seqlen_offset: can be used in generation where the qkv being passed in is only the last
198
+ token in the batch.
199
+ """
200
+ self._update_cos_sin_cache(q.shape[1] + seqlen_offset, device=q.device, dtype=q.dtype)
201
+ if self.scale is None:
202
+ return apply_rotary_emb_func(
203
+ q, self._cos_cached[seqlen_offset:], self._sin_cached[seqlen_offset:],
204
+ self.interleaved, True # inplace=True
205
+ ), apply_rotary_emb_func(
206
+ k, self._cos_cached[seqlen_offset:], self._sin_cached[seqlen_offset:],
207
+ self.interleaved, True # inplace=True
208
+ )
209
+ else:
210
+ assert False
211
+
212
+ class LlamaMLP(nn.Module):
213
+ def __init__(self, config):
214
+ super().__init__()
215
+ self.config = config
216
+ self.hidden_size = config.hidden_size
217
+ self.intermediate_size = config.intermediate_size
218
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
219
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
220
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
221
+ self.act_fn = ACT2FN[config.hidden_act]
222
+
223
+ def forward(self, x):
224
+ if self.config.pretraining_tp > 1:
225
+ slice = self.intermediate_size // self.config.pretraining_tp
226
+ gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
227
+ up_proj_slices = self.up_proj.weight.split(slice, dim=0)
228
+ down_proj_slices = self.down_proj.weight.split(slice, dim=1)
229
+
230
+ gate_proj = torch.cat(
231
+ [F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
232
+ )
233
+ up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
234
+
235
+ intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
236
+ down_proj = [
237
+ F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
238
+ ]
239
+ down_proj = sum(down_proj)
240
+ else:
241
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
242
+
243
+ return down_proj
244
+
245
+ @torch.jit.script
246
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
247
+ """
248
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
249
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
250
+ """
251
+ batch, slen, _, num_key_value_heads, head_dim = hidden_states.shape
252
+ if n_rep == 1:
253
+ return hidden_states
254
+ hidden_states = hidden_states[:, :, :, :, None, :].expand(batch, slen, 2, num_key_value_heads, n_rep, head_dim)
255
+ return hidden_states.reshape(batch, slen, 2, num_key_value_heads * n_rep, head_dim)
256
+
257
+
258
+ class LlamaAttention(nn.Module):
259
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
260
+
261
+ def __init__(self, config: LlamaConfig):
262
+ super().__init__()
263
+ self.config = config
264
+ self.hidden_size = config.hidden_size
265
+ self.num_heads = config.num_attention_heads
266
+ self.head_dim = self.hidden_size // self.num_heads
267
+ self.num_key_value_heads = config.num_key_value_heads
268
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
269
+ self.max_position_embeddings = config.max_position_embeddings
270
+
271
+ if (self.head_dim * self.num_heads) != self.hidden_size:
272
+ raise ValueError(
273
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
274
+ f" and `num_heads`: {self.num_heads})."
275
+ )
276
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
277
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
278
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
279
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
280
+
281
+ self.register_buffer(
282
+ "norm_factor",
283
+ torch.sqrt(torch.tensor(self.head_dim, dtype=torch.float32)).to(torch.get_default_dtype()),
284
+ persistent=False,
285
+ )
286
+
287
+ if self.config.rope_scaling is None:
288
+ scaling_factor = 1
289
+ else:
290
+ scaling_type = self.config.rope_scaling["type"]
291
+ scaling_factor = self.config.rope_scaling["factor"]
292
+ assert scaling_type == 'linear'
293
+
294
+ self.rotary_emb = FlashRotaryEmbedding(
295
+ self.head_dim, base=10000, interleaved=False, scaling_factor=scaling_factor,
296
+ )
297
+
298
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
299
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
300
+
301
+ def forward(
302
+ self,
303
+ hidden_states: torch.Tensor,
304
+ attention_mask: Optional[torch.Tensor] = None,
305
+ position_ids: Optional[torch.LongTensor] = None,
306
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
307
+ output_attentions: bool = False,
308
+ use_cache: bool = False,
309
+ is_padded_inputs: Optional[bool] = False,
310
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
311
+ bsz, q_len, h_size = hidden_states.size()
312
+
313
+ has_layer_past = past_key_value is not None
314
+
315
+ if has_layer_past:
316
+ past_kv = past_key_value[0]
317
+ past_len = past_key_value[1]
318
+ else:
319
+ past_len = 0
320
+
321
+ if self.config.pretraining_tp > 1:
322
+ key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
323
+ query_slices = self.q_proj.weight.split(
324
+ (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
325
+ )
326
+ key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
327
+ value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
328
+
329
+ q = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
330
+ q = torch.cat(q, dim=-1)
331
+
332
+ k = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
333
+ k = torch.cat(k, dim=-1)
334
+
335
+ v = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
336
+ v = torch.cat(v, dim=-1)
337
+
338
+ else:
339
+ q = self.q_proj(hidden_states)
340
+ k = self.k_proj(hidden_states)
341
+ v = self.v_proj(hidden_states)
342
+
343
+ q = q.view(bsz, q_len, self.num_heads, self.head_dim)
344
+ k = k.view(bsz, q_len, self.num_key_value_heads, self.head_dim)
345
+ v = v.view(bsz, q_len, self.num_key_value_heads, self.head_dim)
346
+
347
+ q, k = self.rotary_emb(q, k, past_len)
348
+
349
+ kv = torch.stack([k, v], 2)
350
+ kv = repeat_kv(kv, self.num_key_value_groups)
351
+
352
+ # Cache QKV values
353
+ if has_layer_past:
354
+ new_len = past_len+q.size(1)
355
+ if new_len > past_kv.size(1):
356
+ past_kv = torch.cat([past_kv, torch.empty(bsz, 256, 2, kv.size(3), kv.size(4), dtype=kv.dtype, device=kv.device)], 1)
357
+ past_kv[:, past_len:new_len] = kv
358
+ kv = past_kv[:, :new_len]
359
+ else:
360
+ past_kv = kv
361
+
362
+ past_key_value = (past_kv, past_len+q.size(1)) if use_cache else None
363
+
364
+ if is_padded_inputs:
365
+
366
+ # varlen, ignore padding tokens, efficient for large batch with many paddings
367
+
368
+ assert attention_mask is not None
369
+
370
+ unpadded_kv, indices_k, cu_seqlens_k, max_seqlen_k = unpad_input(kv, attention_mask)
371
+ unpadded_q, indices_q, cu_seqlens_q, max_seqlen_q = unpad_input(q, attention_mask[:, -q.size(1):])
372
+ attn_outputs = flash_attn_varlen_kvpacked_func(
373
+ unpadded_q, unpadded_kv, cu_seqlens_q, cu_seqlens_k,
374
+ max_seqlen_q, max_seqlen_k,
375
+ dropout_p=0.0, softmax_scale=1.0/self.norm_factor,
376
+ causal=(not has_layer_past), return_attn_probs=output_attentions
377
+ )
378
+
379
+ attn_output = attn_outputs[0] if output_attentions else attn_outputs
380
+ attn_output = pad_input(
381
+ attn_output, indices_q, bsz, max_seqlen_q
382
+ ).reshape(bsz, q_len, h_size)
383
+ attn_weights = attn_outputs[2] if output_attentions else None
384
+
385
+ else:
386
+
387
+ # no padding tokens, more efficient
388
+
389
+ attn_outputs = flash_attn_kvpacked_func(
390
+ q, kv, dropout_p=0.0, softmax_scale=1.0/self.norm_factor, causal=(not has_layer_past), return_attn_probs=output_attentions)
391
+
392
+ attn_output = attn_outputs[0] if output_attentions else attn_outputs
393
+ attn_output = attn_output.reshape(bsz, q_len, h_size)
394
+ attn_weights = attn_outputs[2] if output_attentions else None
395
+
396
+ if self.config.pretraining_tp > 1:
397
+ attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
398
+ o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
399
+ attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
400
+ else:
401
+ attn_output = self.o_proj(attn_output)
402
+
403
+ if not output_attentions:
404
+ attn_weights = None
405
+
406
+ return attn_output, attn_weights, past_key_value
407
+
408
+
409
+ class LlamaDecoderLayer(nn.Module):
410
+ def __init__(self, config: LlamaConfig):
411
+ super().__init__()
412
+ self.hidden_size = config.hidden_size
413
+ self.self_attn = LlamaAttention(config=config)
414
+ self.mlp = LlamaMLP(config)
415
+ self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
416
+ self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
417
+
418
+ def forward(
419
+ self,
420
+ hidden_states: torch.Tensor,
421
+ attention_mask: Optional[torch.Tensor] = None,
422
+ position_ids: Optional[torch.LongTensor] = None,
423
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
424
+ is_padded_inputs: Optional[bool] = False,
425
+ output_attentions: Optional[bool] = False,
426
+ use_cache: Optional[bool] = False,
427
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
428
+ """
429
+ Args:
430
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
431
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
432
+ `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
433
+ output_attentions (`bool`, *optional*):
434
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
435
+ returned tensors for more detail.
436
+ use_cache (`bool`, *optional*):
437
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
438
+ (see `past_key_values`).
439
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
440
+ """
441
+
442
+ residual = hidden_states
443
+
444
+ hidden_states = self.input_layernorm(hidden_states)
445
+
446
+ # Self Attention
447
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
448
+ hidden_states=hidden_states,
449
+ attention_mask=attention_mask,
450
+ position_ids=position_ids,
451
+ past_key_value=past_key_value,
452
+ output_attentions=output_attentions,
453
+ use_cache=use_cache,
454
+ is_padded_inputs=is_padded_inputs,
455
+ )
456
+ hidden_states = residual + hidden_states
457
+
458
+ # Fully Connected
459
+ residual = hidden_states
460
+ hidden_states = self.post_attention_layernorm(hidden_states)
461
+ hidden_states = self.mlp(hidden_states)
462
+ hidden_states = residual + hidden_states
463
+
464
+ outputs = (hidden_states,)
465
+
466
+ if output_attentions:
467
+ outputs += (self_attn_weights,)
468
+
469
+ if use_cache:
470
+ outputs += (present_key_value,)
471
+
472
+ return outputs
473
+
474
+
475
+ LLAMA_START_DOCSTRING = r"""
476
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
477
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
478
+ etc.)
479
+
480
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
481
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
482
+ and behavior.
483
+
484
+ Parameters:
485
+ config ([`LlamaConfig`]):
486
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
487
+ load the weights associated with the model, only the configuration. Check out the
488
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
489
+ """
490
+
491
+
492
+ @add_start_docstrings(
493
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
494
+ LLAMA_START_DOCSTRING,
495
+ )
496
+ class LlamaPreTrainedModel(PreTrainedModel):
497
+ config_class = LlamaConfig
498
+ base_model_prefix = "model"
499
+ supports_gradient_checkpointing = True
500
+ _no_split_modules = ["LlamaDecoderLayer"]
501
+ _skip_keys_device_placement = "past_key_values"
502
+ _supports_flash_attn_2 = True
503
+
504
+ def _init_weights(self, module):
505
+ std = self.config.initializer_range
506
+ if isinstance(module, nn.Linear):
507
+ module.weight.data.normal_(mean=0.0, std=std)
508
+ if module.bias is not None:
509
+ module.bias.data.zero_()
510
+ elif isinstance(module, nn.Embedding):
511
+ module.weight.data.normal_(mean=0.0, std=std)
512
+ if module.padding_idx is not None:
513
+ module.weight.data[module.padding_idx].zero_()
514
+
515
+ def _set_gradient_checkpointing(self, module, value=False):
516
+ if isinstance(module, LlamaModel):
517
+ module.gradient_checkpointing = value
518
+
519
+
520
+ LLAMA_INPUTS_DOCSTRING = r"""
521
+ Args:
522
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
523
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
524
+ it.
525
+
526
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
527
+ [`PreTrainedTokenizer.__call__`] for details.
528
+
529
+ [What are input IDs?](../glossary#input-ids)
530
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
531
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
532
+
533
+ - 1 for tokens that are **not masked**,
534
+ - 0 for tokens that are **masked**.
535
+
536
+ [What are attention masks?](../glossary#attention-mask)
537
+
538
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
539
+ [`PreTrainedTokenizer.__call__`] for details.
540
+
541
+ If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
542
+ `past_key_values`).
543
+
544
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
545
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
546
+ information on the default strategy.
547
+
548
+ - 1 indicates the head is **not masked**,
549
+ - 0 indicates the head is **masked**.
550
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
551
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
552
+ config.n_positions - 1]`.
553
+
554
+ [What are position IDs?](../glossary#position-ids)
555
+ past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
556
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
557
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
558
+ `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
559
+
560
+ Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
561
+ blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
562
+
563
+ If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
564
+ don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
565
+ `decoder_input_ids` of shape `(batch_size, sequence_length)`.
566
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
567
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
568
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
569
+ model's internal embedding lookup matrix.
570
+ use_cache (`bool`, *optional*):
571
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
572
+ `past_key_values`).
573
+ output_attentions (`bool`, *optional*):
574
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
575
+ tensors for more detail.
576
+ output_hidden_states (`bool`, *optional*):
577
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
578
+ more detail.
579
+ return_dict (`bool`, *optional*):
580
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
581
+ """
582
+
583
+
584
+ @add_start_docstrings(
585
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
586
+ LLAMA_START_DOCSTRING,
587
+ )
588
+ class LlamaModel(LlamaPreTrainedModel):
589
+ """
590
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]
591
+
592
+ Args:
593
+ config: LlamaConfig
594
+ """
595
+
596
+ def __init__(self, config: LlamaConfig):
597
+ super().__init__(config)
598
+ self.padding_idx = config.pad_token_id
599
+ self.vocab_size = config.vocab_size
600
+
601
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
602
+ self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
603
+ self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
604
+
605
+ self.gradient_checkpointing = False
606
+ # Initialize weights and apply final processing
607
+ self.post_init()
608
+
609
+ def get_input_embeddings(self):
610
+ return self.embed_tokens
611
+
612
+ def set_input_embeddings(self, value):
613
+ self.embed_tokens = value
614
+
615
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
616
+ def forward(
617
+ self,
618
+ input_ids: torch.LongTensor = None,
619
+ attention_mask: Optional[torch.Tensor] = None,
620
+ position_ids: Optional[torch.LongTensor] = None,
621
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
622
+ inputs_embeds: Optional[torch.FloatTensor] = None,
623
+ use_cache: Optional[bool] = None,
624
+ output_attentions: Optional[bool] = None,
625
+ output_hidden_states: Optional[bool] = None,
626
+ return_dict: Optional[bool] = None,
627
+ is_padded_inputs: Optional[bool] = False,
628
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
629
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
630
+ output_hidden_states = (
631
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
632
+ )
633
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
634
+
635
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
636
+
637
+ # retrieve input_ids and inputs_embeds
638
+ if input_ids is not None and inputs_embeds is not None:
639
+ raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
640
+ elif input_ids is not None:
641
+ batch_size, seq_length = input_ids.shape
642
+ elif inputs_embeds is not None:
643
+ batch_size, seq_length, _ = inputs_embeds.shape
644
+ else:
645
+ raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
646
+
647
+ seq_length_with_past = seq_length
648
+ past_key_values_length = 0
649
+
650
+ if past_key_values is not None:
651
+ past_key_values_length = past_key_values[0][0].shape[2]
652
+ seq_length_with_past = seq_length_with_past + past_key_values_length
653
+
654
+ position_ids = None
655
+
656
+ if inputs_embeds is None:
657
+ inputs_embeds = self.embed_tokens(input_ids)
658
+
659
+ hidden_states = inputs_embeds
660
+
661
+ if self.gradient_checkpointing and self.training:
662
+ if use_cache:
663
+ logger.warning_once(
664
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
665
+ )
666
+ use_cache = False
667
+
668
+ # decoder layers
669
+ all_hidden_states = () if output_hidden_states else None
670
+ all_self_attns = () if output_attentions else None
671
+ next_decoder_cache = () if use_cache else None
672
+
673
+ for idx, decoder_layer in enumerate(self.layers):
674
+ if output_hidden_states:
675
+ all_hidden_states += (hidden_states,)
676
+
677
+ past_key_value = past_key_values[idx] if past_key_values is not None else None
678
+
679
+ if self.gradient_checkpointing and self.training:
680
+
681
+ def create_custom_forward(module):
682
+ def custom_forward(*inputs):
683
+ # None for past_key_value
684
+ return module(*inputs, output_attentions, None)
685
+
686
+ return custom_forward
687
+
688
+ layer_outputs = torch.utils.checkpoint.checkpoint(
689
+ create_custom_forward(decoder_layer),
690
+ hidden_states,
691
+ attention_mask,
692
+ position_ids,
693
+ None,
694
+ is_padded_inputs
695
+ )
696
+ else:
697
+ layer_outputs = decoder_layer(
698
+ hidden_states,
699
+ attention_mask=attention_mask,
700
+ position_ids=position_ids,
701
+ past_key_value=past_key_value,
702
+ output_attentions=output_attentions,
703
+ use_cache=use_cache,
704
+ is_padded_inputs=is_padded_inputs,
705
+ )
706
+
707
+ hidden_states = layer_outputs[0]
708
+
709
+ if use_cache:
710
+ next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
711
+
712
+ if output_attentions:
713
+ all_self_attns += (layer_outputs[1],)
714
+
715
+ hidden_states = self.norm(hidden_states)
716
+
717
+ # add hidden states from the last decoder layer
718
+ if output_hidden_states:
719
+ all_hidden_states += (hidden_states,)
720
+
721
+ next_cache = next_decoder_cache if use_cache else None
722
+ if not return_dict:
723
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
724
+ return BaseModelOutputWithPast(
725
+ last_hidden_state=hidden_states,
726
+ past_key_values=next_cache,
727
+ hidden_states=all_hidden_states,
728
+ attentions=all_self_attns,
729
+ )
730
+
731
+
732
+ class LlamaForCausalLM(LlamaPreTrainedModel):
733
+ _tied_weights_keys = ["lm_head.weight"]
734
+
735
+ def __init__(self, config):
736
+ super().__init__(config)
737
+ self.model = LlamaModel(config)
738
+ self.vocab_size = config.vocab_size
739
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
740
+
741
+ # Initialize weights and apply final processing
742
+ self.post_init()
743
+
744
+ def get_input_embeddings(self):
745
+ return self.model.embed_tokens
746
+
747
+ def set_input_embeddings(self, value):
748
+ self.model.embed_tokens = value
749
+
750
+ def get_output_embeddings(self):
751
+ return self.lm_head
752
+
753
+ def set_output_embeddings(self, new_embeddings):
754
+ self.lm_head = new_embeddings
755
+
756
+ def set_decoder(self, decoder):
757
+ self.model = decoder
758
+
759
+ def get_decoder(self):
760
+ return self.model
761
+
762
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
763
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
764
+ def forward(
765
+ self,
766
+ input_ids: torch.LongTensor = None,
767
+ attention_mask: Optional[torch.Tensor] = None,
768
+ position_ids: Optional[torch.LongTensor] = None,
769
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
770
+ inputs_embeds: Optional[torch.FloatTensor] = None,
771
+ labels: Optional[torch.LongTensor] = None,
772
+ use_cache: Optional[bool] = None,
773
+ output_attentions: Optional[bool] = None,
774
+ output_hidden_states: Optional[bool] = None,
775
+ return_dict: Optional[bool] = None,
776
+ is_padded_inputs: Optional[bool] = None,
777
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
778
+ r"""
779
+ Args:
780
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
781
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
782
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
783
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
784
+
785
+ Returns:
786
+
787
+ Example:
788
+
789
+ ```python
790
+ >>> from transformers import AutoTokenizer, LlamaForCausalLM
791
+
792
+ >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
793
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
794
+
795
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
796
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
797
+
798
+ >>> # Generate
799
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
800
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
801
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
802
+ ```"""
803
+
804
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
805
+ output_hidden_states = (
806
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
807
+ )
808
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
809
+
810
+ is_padded_inputs = ((attention_mask is not None) and (not attention_mask.all().item()))
811
+
812
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
813
+ outputs = self.model(
814
+ input_ids=input_ids,
815
+ attention_mask=attention_mask,
816
+ position_ids=position_ids,
817
+ past_key_values=past_key_values,
818
+ inputs_embeds=inputs_embeds,
819
+ use_cache=use_cache,
820
+ output_attentions=output_attentions,
821
+ output_hidden_states=output_hidden_states,
822
+ return_dict=return_dict,
823
+ is_padded_inputs=is_padded_inputs,
824
+ )
825
+
826
+ hidden_states = outputs[0]
827
+ if self.config.pretraining_tp > 1:
828
+ lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
829
+ logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
830
+ logits = torch.cat(logits, dim=-1)
831
+ else:
832
+ logits = self.lm_head(hidden_states)
833
+ logits = logits.float()
834
+
835
+ loss = None
836
+ if labels is not None:
837
+ # Shift so that tokens < n predict n
838
+ shift_logits = logits[..., :-1, :].contiguous()
839
+ shift_labels = labels[..., 1:].contiguous()
840
+ # Flatten the tokens
841
+ loss_fct = CrossEntropyLoss()
842
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
843
+ shift_labels = shift_labels.view(-1)
844
+ # Enable model parallelism
845
+ shift_labels = shift_labels.to(shift_logits.device)
846
+ loss = loss_fct(shift_logits, shift_labels)
847
+
848
+ if not return_dict:
849
+ output = (logits,) + outputs[1:]
850
+ return (loss,) + output if loss is not None else output
851
+
852
+ return CausalLMOutputWithPast(
853
+ loss=loss,
854
+ logits=logits,
855
+ past_key_values=outputs.past_key_values,
856
+ hidden_states=outputs.hidden_states,
857
+ attentions=outputs.attentions,
858
+ )
859
+
860
+ def prepare_inputs_for_generation(
861
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
862
+ ):
863
+ if past_key_values:
864
+ input_ids = input_ids[:, -1:]
865
+
866
+ position_ids = kwargs.get("position_ids", None)
867
+
868
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
869
+ if inputs_embeds is not None and past_key_values is None:
870
+ model_inputs = {"inputs_embeds": inputs_embeds}
871
+ else:
872
+ model_inputs = {"input_ids": input_ids}
873
+
874
+ model_inputs.update(
875
+ {
876
+ "position_ids": position_ids,
877
+ "past_key_values": past_key_values,
878
+ "use_cache": kwargs.get("use_cache"),
879
+ "attention_mask": attention_mask,
880
+ "is_padded_inputs": ((attention_mask is not None) and (not attention_mask.all().item()))
881
+ }
882
+ )
883
+ return model_inputs
884
+
885
+ @staticmethod
886
+ def _reorder_cache(past_key_values, beam_idx):
887
+ reordered_past = ()
888
+ for layer_past in past_key_values:
889
+ reordered_past += (
890
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
891
+ )
892
+ return reordered_past
893
+
894
+
895
+ @add_start_docstrings(
896
+ """
897
+ The LLaMa Model transformer with a sequence classification head on top (linear layer).
898
+
899
+ [`LlamaForSequenceClassification`] uses the last token in order to do the classification, as other causal models
900
+ (e.g. GPT-2) do.
901
+
902
+ Since it does classification on the last token, it requires to know the position of the last token. If a
903
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
904
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
905
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
906
+ each row of the batch).
907
+ """,
908
+ LLAMA_START_DOCSTRING,
909
+ )
910
+ class LlamaForSequenceClassification(LlamaPreTrainedModel):
911
+ def __init__(self, config):
912
+ super().__init__(config)
913
+ self.num_labels = config.num_labels
914
+ self.model = LlamaModel(config)
915
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
916
+
917
+ # Initialize weights and apply final processing
918
+ self.post_init()
919
+
920
+ def get_input_embeddings(self):
921
+ return self.model.embed_tokens
922
+
923
+ def set_input_embeddings(self, value):
924
+ self.model.embed_tokens = value
925
+
926
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
927
+ def forward(
928
+ self,
929
+ input_ids: torch.LongTensor = None,
930
+ attention_mask: Optional[torch.Tensor] = None,
931
+ position_ids: Optional[torch.LongTensor] = None,
932
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
933
+ inputs_embeds: Optional[torch.FloatTensor] = None,
934
+ labels: Optional[torch.LongTensor] = None,
935
+ use_cache: Optional[bool] = None,
936
+ output_attentions: Optional[bool] = None,
937
+ output_hidden_states: Optional[bool] = None,
938
+ return_dict: Optional[bool] = None,
939
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
940
+ r"""
941
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
942
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
943
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
944
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
945
+ """
946
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
947
+
948
+ transformer_outputs = self.model(
949
+ input_ids,
950
+ attention_mask=attention_mask,
951
+ position_ids=position_ids,
952
+ past_key_values=past_key_values,
953
+ inputs_embeds=inputs_embeds,
954
+ use_cache=use_cache,
955
+ output_attentions=output_attentions,
956
+ output_hidden_states=output_hidden_states,
957
+ return_dict=return_dict,
958
+ )
959
+ hidden_states = transformer_outputs[0]
960
+ logits = self.score(hidden_states)
961
+
962
+ if input_ids is not None:
963
+ batch_size = input_ids.shape[0]
964
+ else:
965
+ batch_size = inputs_embeds.shape[0]
966
+
967
+ if self.config.pad_token_id is None and batch_size != 1:
968
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
969
+ if self.config.pad_token_id is None:
970
+ sequence_lengths = -1
971
+ else:
972
+ if input_ids is not None:
973
+ sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device)
974
+ else:
975
+ sequence_lengths = -1
976
+
977
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
978
+
979
+ loss = None
980
+ if labels is not None:
981
+ labels = labels.to(logits.device)
982
+ if self.config.problem_type is None:
983
+ if self.num_labels == 1:
984
+ self.config.problem_type = "regression"
985
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
986
+ self.config.problem_type = "single_label_classification"
987
+ else:
988
+ self.config.problem_type = "multi_label_classification"
989
+
990
+ if self.config.problem_type == "regression":
991
+ loss_fct = MSELoss()
992
+ if self.num_labels == 1:
993
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
994
+ else:
995
+ loss = loss_fct(pooled_logits, labels)
996
+ elif self.config.problem_type == "single_label_classification":
997
+ loss_fct = CrossEntropyLoss()
998
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
999
+ elif self.config.problem_type == "multi_label_classification":
1000
+ loss_fct = BCEWithLogitsLoss()
1001
+ loss = loss_fct(pooled_logits, labels)
1002
+ if not return_dict:
1003
+ output = (pooled_logits,) + transformer_outputs[1:]
1004
+ return ((loss,) + output) if loss is not None else output
1005
+
1006
+ return SequenceClassifierOutputWithPast(
1007
+ loss=loss,
1008
+ logits=pooled_logits,
1009
+ past_key_values=transformer_outputs.past_key_values,
1010
+ hidden_states=transformer_outputs.hidden_states,
1011
+ attentions=transformer_outputs.attentions,
1012
+ )
pytorch_model-00001-of-00002.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3a9b9c55c94174d64d95c585d7f95cad27a68eb1a79a27ae725662f55c4185b6
3
+ size 9976631486
pytorch_model-00002-of-00002.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aac653834c32be990b2c97bc4a5c1999db761789595335d979ee33ded4be6a7e
3
+ size 3500314451
pytorch_model.bin.index.json ADDED
@@ -0,0 +1,330 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 13476835328
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "pytorch_model-00002-of-00002.bin",
7
+ "model.embed_tokens.weight": "pytorch_model-00001-of-00002.bin",
8
+ "model.layers.0.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
9
+ "model.layers.0.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
10
+ "model.layers.0.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
11
+ "model.layers.0.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
12
+ "model.layers.0.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
13
+ "model.layers.0.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
14
+ "model.layers.0.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
15
+ "model.layers.0.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
16
+ "model.layers.0.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
17
+ "model.layers.0.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
18
+ "model.layers.1.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
19
+ "model.layers.1.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
20
+ "model.layers.1.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
21
+ "model.layers.1.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
22
+ "model.layers.1.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
23
+ "model.layers.1.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
24
+ "model.layers.1.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
25
+ "model.layers.1.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
26
+ "model.layers.1.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
27
+ "model.layers.1.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
28
+ "model.layers.10.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
29
+ "model.layers.10.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
30
+ "model.layers.10.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
31
+ "model.layers.10.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
32
+ "model.layers.10.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
33
+ "model.layers.10.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
34
+ "model.layers.10.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
35
+ "model.layers.10.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
36
+ "model.layers.10.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
37
+ "model.layers.10.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
38
+ "model.layers.11.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
39
+ "model.layers.11.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
40
+ "model.layers.11.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
41
+ "model.layers.11.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
42
+ "model.layers.11.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
43
+ "model.layers.11.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
44
+ "model.layers.11.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
45
+ "model.layers.11.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
46
+ "model.layers.11.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
47
+ "model.layers.11.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
48
+ "model.layers.12.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
49
+ "model.layers.12.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
50
+ "model.layers.12.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
51
+ "model.layers.12.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
52
+ "model.layers.12.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
53
+ "model.layers.12.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
54
+ "model.layers.12.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
55
+ "model.layers.12.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
56
+ "model.layers.12.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
57
+ "model.layers.12.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
58
+ "model.layers.13.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
59
+ "model.layers.13.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
60
+ "model.layers.13.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
61
+ "model.layers.13.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
62
+ "model.layers.13.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
63
+ "model.layers.13.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
64
+ "model.layers.13.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
65
+ "model.layers.13.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
66
+ "model.layers.13.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
67
+ "model.layers.13.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
68
+ "model.layers.14.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
69
+ "model.layers.14.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
70
+ "model.layers.14.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
71
+ "model.layers.14.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
72
+ "model.layers.14.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
73
+ "model.layers.14.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
74
+ "model.layers.14.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
75
+ "model.layers.14.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
76
+ "model.layers.14.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
77
+ "model.layers.14.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
78
+ "model.layers.15.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
79
+ "model.layers.15.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
80
+ "model.layers.15.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
81
+ "model.layers.15.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
82
+ "model.layers.15.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
83
+ "model.layers.15.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
84
+ "model.layers.15.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
85
+ "model.layers.15.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
86
+ "model.layers.15.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
87
+ "model.layers.15.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
88
+ "model.layers.16.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
89
+ "model.layers.16.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
90
+ "model.layers.16.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
91
+ "model.layers.16.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
92
+ "model.layers.16.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
93
+ "model.layers.16.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
94
+ "model.layers.16.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
95
+ "model.layers.16.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
96
+ "model.layers.16.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
97
+ "model.layers.16.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
98
+ "model.layers.17.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
99
+ "model.layers.17.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
100
+ "model.layers.17.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
101
+ "model.layers.17.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
102
+ "model.layers.17.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
103
+ "model.layers.17.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
104
+ "model.layers.17.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
105
+ "model.layers.17.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
106
+ "model.layers.17.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
107
+ "model.layers.17.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
108
+ "model.layers.18.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
109
+ "model.layers.18.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
110
+ "model.layers.18.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
111
+ "model.layers.18.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
112
+ "model.layers.18.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
113
+ "model.layers.18.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
114
+ "model.layers.18.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
115
+ "model.layers.18.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
116
+ "model.layers.18.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
117
+ "model.layers.18.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
118
+ "model.layers.19.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
119
+ "model.layers.19.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
120
+ "model.layers.19.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
121
+ "model.layers.19.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
122
+ "model.layers.19.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
123
+ "model.layers.19.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
124
+ "model.layers.19.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
125
+ "model.layers.19.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
126
+ "model.layers.19.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
127
+ "model.layers.19.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
128
+ "model.layers.2.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
129
+ "model.layers.2.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
130
+ "model.layers.2.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
131
+ "model.layers.2.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
132
+ "model.layers.2.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
133
+ "model.layers.2.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
134
+ "model.layers.2.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
135
+ "model.layers.2.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
136
+ "model.layers.2.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
137
+ "model.layers.2.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
138
+ "model.layers.20.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
139
+ "model.layers.20.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
140
+ "model.layers.20.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
141
+ "model.layers.20.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
142
+ "model.layers.20.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
143
+ "model.layers.20.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
144
+ "model.layers.20.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
145
+ "model.layers.20.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
146
+ "model.layers.20.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
147
+ "model.layers.20.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
148
+ "model.layers.21.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
149
+ "model.layers.21.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
150
+ "model.layers.21.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
151
+ "model.layers.21.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
152
+ "model.layers.21.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
153
+ "model.layers.21.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
154
+ "model.layers.21.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
155
+ "model.layers.21.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
156
+ "model.layers.21.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
157
+ "model.layers.21.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
158
+ "model.layers.22.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
159
+ "model.layers.22.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
160
+ "model.layers.22.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
161
+ "model.layers.22.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
162
+ "model.layers.22.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
163
+ "model.layers.22.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
164
+ "model.layers.22.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
165
+ "model.layers.22.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
166
+ "model.layers.22.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
167
+ "model.layers.22.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
168
+ "model.layers.23.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
169
+ "model.layers.23.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
170
+ "model.layers.23.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
171
+ "model.layers.23.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
172
+ "model.layers.23.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
173
+ "model.layers.23.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
174
+ "model.layers.23.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
175
+ "model.layers.23.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
176
+ "model.layers.23.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
177
+ "model.layers.23.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
178
+ "model.layers.24.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
179
+ "model.layers.24.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
180
+ "model.layers.24.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
181
+ "model.layers.24.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
182
+ "model.layers.24.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
183
+ "model.layers.24.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
184
+ "model.layers.24.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
185
+ "model.layers.24.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
186
+ "model.layers.24.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
187
+ "model.layers.24.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
188
+ "model.layers.25.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
189
+ "model.layers.25.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
190
+ "model.layers.25.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
191
+ "model.layers.25.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
192
+ "model.layers.25.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
193
+ "model.layers.25.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
194
+ "model.layers.25.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
195
+ "model.layers.25.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
196
+ "model.layers.25.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
197
+ "model.layers.25.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
198
+ "model.layers.26.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
199
+ "model.layers.26.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
200
+ "model.layers.26.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
201
+ "model.layers.26.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
202
+ "model.layers.26.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
203
+ "model.layers.26.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
204
+ "model.layers.26.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
205
+ "model.layers.26.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
206
+ "model.layers.26.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
207
+ "model.layers.26.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
208
+ "model.layers.27.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
209
+ "model.layers.27.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
210
+ "model.layers.27.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
211
+ "model.layers.27.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
212
+ "model.layers.27.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
213
+ "model.layers.27.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
214
+ "model.layers.27.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
215
+ "model.layers.27.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
216
+ "model.layers.27.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
217
+ "model.layers.27.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
218
+ "model.layers.28.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
219
+ "model.layers.28.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
220
+ "model.layers.28.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
221
+ "model.layers.28.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
222
+ "model.layers.28.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
223
+ "model.layers.28.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
224
+ "model.layers.28.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
225
+ "model.layers.28.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
226
+ "model.layers.28.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
227
+ "model.layers.28.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
228
+ "model.layers.29.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
229
+ "model.layers.29.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
230
+ "model.layers.29.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
231
+ "model.layers.29.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
232
+ "model.layers.29.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
233
+ "model.layers.29.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
234
+ "model.layers.29.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
235
+ "model.layers.29.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
236
+ "model.layers.29.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
237
+ "model.layers.29.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
238
+ "model.layers.3.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
239
+ "model.layers.3.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
240
+ "model.layers.3.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
241
+ "model.layers.3.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
242
+ "model.layers.3.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
243
+ "model.layers.3.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
244
+ "model.layers.3.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
245
+ "model.layers.3.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
246
+ "model.layers.3.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
247
+ "model.layers.3.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
248
+ "model.layers.30.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
249
+ "model.layers.30.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
250
+ "model.layers.30.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
251
+ "model.layers.30.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
252
+ "model.layers.30.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
253
+ "model.layers.30.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
254
+ "model.layers.30.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
255
+ "model.layers.30.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
256
+ "model.layers.30.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
257
+ "model.layers.30.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
258
+ "model.layers.31.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
259
+ "model.layers.31.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
260
+ "model.layers.31.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
261
+ "model.layers.31.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
262
+ "model.layers.31.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
263
+ "model.layers.31.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
264
+ "model.layers.31.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
265
+ "model.layers.31.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
266
+ "model.layers.31.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
267
+ "model.layers.31.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
268
+ "model.layers.4.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
269
+ "model.layers.4.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
270
+ "model.layers.4.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
271
+ "model.layers.4.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
272
+ "model.layers.4.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
273
+ "model.layers.4.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
274
+ "model.layers.4.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
275
+ "model.layers.4.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
276
+ "model.layers.4.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
277
+ "model.layers.4.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
278
+ "model.layers.5.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
279
+ "model.layers.5.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
280
+ "model.layers.5.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
281
+ "model.layers.5.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
282
+ "model.layers.5.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
283
+ "model.layers.5.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
284
+ "model.layers.5.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
285
+ "model.layers.5.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
286
+ "model.layers.5.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
287
+ "model.layers.5.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
288
+ "model.layers.6.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
289
+ "model.layers.6.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
290
+ "model.layers.6.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
291
+ "model.layers.6.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
292
+ "model.layers.6.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
293
+ "model.layers.6.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
294
+ "model.layers.6.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
295
+ "model.layers.6.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
296
+ "model.layers.6.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
297
+ "model.layers.6.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
298
+ "model.layers.7.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
299
+ "model.layers.7.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
300
+ "model.layers.7.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
301
+ "model.layers.7.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
302
+ "model.layers.7.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
303
+ "model.layers.7.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
304
+ "model.layers.7.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
305
+ "model.layers.7.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
306
+ "model.layers.7.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
307
+ "model.layers.7.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
308
+ "model.layers.8.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
309
+ "model.layers.8.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
310
+ "model.layers.8.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
311
+ "model.layers.8.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
312
+ "model.layers.8.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
313
+ "model.layers.8.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
314
+ "model.layers.8.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
315
+ "model.layers.8.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
316
+ "model.layers.8.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
317
+ "model.layers.8.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
318
+ "model.layers.9.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
319
+ "model.layers.9.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
320
+ "model.layers.9.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
321
+ "model.layers.9.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
322
+ "model.layers.9.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
323
+ "model.layers.9.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
324
+ "model.layers.9.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
325
+ "model.layers.9.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
326
+ "model.layers.9.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
327
+ "model.layers.9.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
328
+ "model.norm.weight": "pytorch_model-00002-of-00002.bin"
329
+ }
330
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "eos_token": "</s>",
4
+ "pad_token": "<unk>",
5
+ "unk_token": "<unk>"
6
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
3
+ size 499723
tokenizer_config.json ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": true,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "32000": {
30
+ "content": "<pad>",
31
+ "lstrip": true,
32
+ "normalized": true,
33
+ "rstrip": true,
34
+ "single_word": false,
35
+ "special": false
36
+ }
37
+ },
38
+ "additional_special_tokens": [],
39
+ "bos_token": "<s>",
40
+ "clean_up_tokenization_spaces": false,
41
+ "eos_token": "</s>",
42
+ "legacy": false,
43
+ "model_max_length": 32768,
44
+ "pad_token": "<unk>",
45
+ "sp_model_kwargs": {},
46
+ "spaces_between_special_tokens": false,
47
+ "tokenizer_class": "LlamaTokenizer",
48
+ "unk_token": "<unk>",
49
+ "use_default_system_prompt": true
50
+ }