alexmarques commited on
Commit
f83d577
·
verified ·
1 Parent(s): 7a95cff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -53
README.md CHANGED
@@ -33,8 +33,10 @@ Weight quantization also reduces disk size requirements by approximately 50%.
33
  Only weights and activations of the linear operators within transformers blocks are quantized.
34
  Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
35
  Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
36
- The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
37
- GPTQ used a 1% damping factor and 256 sequences of 8,192 random tokens.
 
 
38
 
39
 
40
  ## Deployment
@@ -69,47 +71,8 @@ generated_text = outputs[0].outputs[0].text
69
  print(generated_text)
70
  ```
71
 
72
-
73
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
74
 
75
- ### Use with transformers
76
-
77
- The following example contemplates how the model can be deployed in Transformers using the `generate()` function.
78
-
79
- ```python
80
- from transformers import AutoTokenizer, AutoModelForCausalLM
81
-
82
- model_id = "neuralmagic/Phi-3-medium-128k-instruct-quantized.w8a8"
83
-
84
- tokenizer = AutoTokenizer.from_pretrained(model_id)
85
- model = AutoModelForCausalLM.from_pretrained(
86
- model_id,
87
- torch_dtype="auto",
88
- device_map="auto",
89
- trust_remote_code=True,
90
- )
91
-
92
- messages = [
93
- {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
94
- {"role": "user", "content": "Who are you?"},
95
- ]
96
-
97
- input_ids = tokenizer.apply_chat_template(
98
- messages,
99
- add_generation_prompt=True,
100
- return_tensors="pt"
101
- ).to(model.device)
102
-
103
- outputs = model.generate(
104
- input_ids,
105
- max_new_tokens=256,
106
- do_sample=True,
107
- temperature=0.6,
108
- top_p=0.9,
109
- )
110
- response = outputs[0][input_ids.shape[-1]:]
111
- print(tokenizer.decode(response, skip_special_tokens=True))
112
- ```
113
 
114
  ## Creation
115
 
@@ -124,22 +87,35 @@ import random
124
 
125
  model_id = "microsoft/Phi-3-medium-128k-instruct"
126
 
127
- num_samples = 256
128
  max_seq_len = 8192
129
 
130
  tokenizer = AutoTokenizer.from_pretrained(model_id)
131
 
132
- max_token_id = len(tokenizer.get_vocab()) - 1
133
- input_ids = [[random.randint(0, max_token_id) for _ in range(max_seq_len)] for _ in range(num_samples)]
134
- attention_mask = num_samples * [max_seq_len * [1]]
135
- ds = Dataset.from_dict({"input_ids": input_ids, "attention_mask": attention_mask})
136
-
137
- recipe = GPTQModifier(
138
- targets="Linear",
139
- scheme="W8A8",
140
- ignore=["lm_head"],
141
- dampening_frac=0.01,
142
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
143
 
144
  model = SparseAutoModelForCausalLM.from_pretrained(
145
  model_id,
 
33
  Only weights and activations of the linear operators within transformers blocks are quantized.
34
  Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
35
  Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
36
+ Linear scaling factors are computed via by minimizong the mean squarred error (MSE).
37
+ The [SmoothQuant](https://arxiv.org/abs/2211.10438) algorithm is used to alleviate outliers in the activations, whereas rhe [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization.
38
+ Both algorithms are implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
39
+ GPTQ used a 1% damping factor and 512 sequences sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
40
 
41
 
42
  ## Deployment
 
71
  print(generated_text)
72
  ```
73
 
 
74
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ## Creation
78
 
 
87
 
88
  model_id = "microsoft/Phi-3-medium-128k-instruct"
89
 
90
+ num_samples = 512
91
  max_seq_len = 8192
92
 
93
  tokenizer = AutoTokenizer.from_pretrained(model_id)
94
 
95
+ def preprocess_fn(example):
96
+ return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
97
+
98
+ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
99
+ ds = ds.shuffle().select(range(num_samples))
100
+ ds = ds.map(preprocess_fn)
101
+
102
+ recipe = [
103
+ SmoothQuantModifier(
104
+ smoothing_strength=0.8,
105
+ mappings=[
106
+ [["re:.*qkv_proj"], "re:.*input_layernorm"],
107
+ [["re:.*gate_up_proj"], "re:.*post_attention_layernorm"],
108
+ ],
109
+ ),
110
+ GPTQModifier(
111
+ sequential=True,
112
+ targets="Linear",
113
+ scheme="W8A8",
114
+ ignore=["lm_head"],
115
+ dampening_frac=0.01,
116
+ observer="mse",
117
+ )
118
+ ]
119
 
120
  model = SparseAutoModelForCausalLM.from_pretrained(
121
  model_id,