RedHatAI
/

Phi-3-medium-128k-instruct-quantized.w8a8

@@ -33,8 +33,10 @@ Weight quantization also reduces disk size requirements by approximately 50%.
 Only weights and activations of the linear operators within transformers blocks are quantized.
 Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
 Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
-The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
-GPTQ used a 1% damping factor and 256 sequences of 8,192 random tokens.
 ## Deployment
@@ -69,47 +71,8 @@ generated_text = outputs[0].outputs[0].text
 print(generated_text)
 ```
 vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
-### Use with transformers
-The following example contemplates how the model can be deployed in Transformers using the `generate()` function.
-```python
-from transformers import AutoTokenizer, AutoModelForCausalLM
-model_id = "neuralmagic/Phi-3-medium-128k-instruct-quantized.w8a8"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(
-    model_id,
-    torch_dtype="auto",
-    device_map="auto",
-    trust_remote_code=True,
-)
-messages = [
-    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
-    {"role": "user", "content": "Who are you?"},
-]
-input_ids = tokenizer.apply_chat_template(
-    messages,
-    add_generation_prompt=True,
-    return_tensors="pt"
-).to(model.device)
-outputs = model.generate(
-    input_ids,
-    max_new_tokens=256,
-    do_sample=True,
-    temperature=0.6,
-    top_p=0.9,
-)
-response = outputs[0][input_ids.shape[-1]:]
-print(tokenizer.decode(response, skip_special_tokens=True))
-```
 ## Creation
@@ -124,22 +87,35 @@ import random
 model_id = "microsoft/Phi-3-medium-128k-instruct"
-num_samples = 256
 max_seq_len = 8192
 tokenizer = AutoTokenizer.from_pretrained(model_id)
-max_token_id = len(tokenizer.get_vocab()) - 1
-input_ids = [[random.randint(0, max_token_id) for _ in range(max_seq_len)] for _ in range(num_samples)]
-attention_mask = num_samples * [max_seq_len * [1]]
-ds = Dataset.from_dict({"input_ids": input_ids, "attention_mask": attention_mask})
-recipe = GPTQModifier(
-  targets="Linear",
-  scheme="W8A8",
-  ignore=["lm_head"],
-  dampening_frac=0.01,
-)
 model = SparseAutoModelForCausalLM.from_pretrained(
   model_id,

 Only weights and activations of the linear operators within transformers blocks are quantized.
 Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension.
 Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations.
+Linear scaling factors are computed via by minimizong the mean squarred error (MSE).
+The [SmoothQuant](https://arxiv.org/abs/2211.10438) algorithm is used to alleviate outliers in the activations, whereas rhe [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization.
+Both algorithms are implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
+GPTQ used a 1% damping factor and 512 sequences sequences taken from Neural Magic's [LLM compression calibration dataset](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration).
 ## Deployment
 print(generated_text)
 ```
 vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
 ## Creation
 model_id = "microsoft/Phi-3-medium-128k-instruct"
+num_samples = 512
 max_seq_len = 8192
 tokenizer = AutoTokenizer.from_pretrained(model_id)
+def preprocess_fn(example):
+  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
+ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
+ds = ds.shuffle().select(range(num_samples))
+ds = ds.map(preprocess_fn)
+recipe = [
+  SmoothQuantModifier(
+    smoothing_strength=0.8,
+    mappings=[
+      [["re:.*qkv_proj"], "re:.*input_layernorm"],
+      [["re:.*gate_up_proj"], "re:.*post_attention_layernorm"],
+    ],
+  ),
+  GPTQModifier(
+    sequential=True,
+    targets="Linear",
+    scheme="W8A8",
+    ignore=["lm_head"],
+    dampening_frac=0.01,
+    observer="mse",
+  )
+]
 model = SparseAutoModelForCausalLM.from_pretrained(
   model_id,