Phi-3 Medium 128k Instruct (GGUF)
This repository provides GGUF quantized models for
Microsoft Phi-3 Medium 128k Instruct.
Quantization was performed using llama.cpp with Importance Matrix (IQ) techniques to optimize VRAM usage, inference speed, and model quality.
π Benchmarks (NVIDIA T4)
Benchmarks were collected on Google Colab (Tesla T4) using llama-bench and llama-perplexity.
| Quant | Size | Speed | Perplexity (WT2) | Notes |
|---|---|---|---|---|
| IQ4_XS | 7.02 GB | 24.40 t/s | 4.64 | β Best overall |
| Q5_K_M | 9.38 GB | 13.04 t/s | 4.60 | High VRAM |
| IQ3_M | 6.03 GB | 14.77 t/s | 6.36 | Low VRAM |
| Q2_K | 4.79 GB | 17.73 t/s | 76.01 | β Unusable |
Notes
- IQ4_XS delivers ~87% higher throughput than Q5_K_M with negligible quality loss.
- 2-bit quantization causes severe degradation on Phi-3 Medium.
π¦ Available Files
| File | Quant | Size | Est. RAM | Use Case |
|---|---|---|---|---|
Phi-3-medium-128k-instruct-Q5_K_M.gguf |
Q5_K_M | 10.0 GB | ~12 GB | Max quality |
Phi-3-medium-128k-instruct-IQ4_XS.gguf |
IQ4_XS | 8.0 GB | ~10 GB | Recommended |
Phi-3-medium-128k-instruct-IQ3_M.gguf |
IQ3_M | 6.5 GB | ~8 GB | Low VRAM |
Phi-3-medium-128k-instruct-Q2_K.gguf |
Q2_K | 4.8 GB | ~6 GB | Testing only |
RAM estimates include KV-cache overhead.
π‘ Quantization Info
Importance Matrix (IQ) quantization uses calibration data to preserve the most important weights, significantly reducing quality loss compared to standard K-quant methods at similar sizes.
Recommended: IQ4_XS
Fits in ~8GB VRAM and sustains 20+ tokens/sec on consumer GPUs (RTX 3060 / 3070 / 4060).
π Usage
β οΈ Choose ONE of the following options. Do NOT run both.
Option 1: CLI (llama.cpp)
Use this if you want an interactive terminal session.
./llama-cli \
-m Phi-3-medium-128k-instruct-IQ4_XS.gguf \
-n -1 \
--color \
-cnv
Option 2: Python (llama-cpp-python)
from llama_cpp import Llama
# Set gpu_layers to -1 to offload all layers to GPU
llm = Llama(
model_path="./Phi-3-medium-128k-instruct-IQ4_XS.gguf",
n_ctx=4096,
n_gpu_layers=-1,
verbose=True
)
output = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum physics."},
]
)
print(output['choices'][0]['message']['content'])
- Downloads last month
- 12
2-bit
3-bit
4-bit
5-bit
Model tree for ogiwrghs/Phi-3-medium-128k-instruct-GGUF
Base model
microsoft/Phi-3-medium-128k-instruct