Phi-3 Medium 128k Instruct (GGUF)

This repository provides GGUF quantized models for
Microsoft Phi-3 Medium 128k Instruct.

Quantization was performed using llama.cpp with Importance Matrix (IQ) techniques to optimize VRAM usage, inference speed, and model quality.

📊 Benchmarks (NVIDIA T4)

Benchmarks were collected on Google Colab (Tesla T4) using llama-bench and llama-perplexity.

Quant	Size	Speed	Perplexity (WT2)	Notes
IQ4_XS	7.02 GB	24.40 t/s	4.64	✅ Best overall
Q5_K_M	9.38 GB	13.04 t/s	4.60	High VRAM
IQ3_M	6.03 GB	14.77 t/s	6.36	Low VRAM
Q2_K	4.79 GB	17.73 t/s	76.01	❌ Unusable

Notes

IQ4_XS delivers ~87% higher throughput than Q5_K_M with negligible quality loss.
2-bit quantization causes severe degradation on Phi-3 Medium.

📦 Available Files

File	Quant	Size	Est. RAM	Use Case
`Phi-3-medium-128k-instruct-Q5_K_M.gguf`	Q5_K_M	10.0 GB	~12 GB	Max quality
`Phi-3-medium-128k-instruct-IQ4_XS.gguf`	IQ4_XS	8.0 GB	~10 GB	Recommended
`Phi-3-medium-128k-instruct-IQ3_M.gguf`	IQ3_M	6.5 GB	~8 GB	Low VRAM
`Phi-3-medium-128k-instruct-Q2_K.gguf`	Q2_K	4.8 GB	~6 GB	Testing only

RAM estimates include KV-cache overhead.

💡 Quantization Info

Importance Matrix (IQ) quantization uses calibration data to preserve the most important weights, significantly reducing quality loss compared to standard K-quant methods at similar sizes.

Recommended: IQ4_XS
Fits in ~8GB VRAM and sustains 20+ tokens/sec on consumer GPUs (RTX 3060 / 3070 / 4060).

🚀 Usage

⚠️ Choose ONE of the following options. Do NOT run both.

Option 1: CLI (llama.cpp)

Use this if you want an interactive terminal session.

./llama-cli \
  -m Phi-3-medium-128k-instruct-IQ4_XS.gguf \
  -n -1 \
  --color \
  -cnv

Option 2: Python (llama-cpp-python)

from llama_cpp import Llama

# Set gpu_layers to -1 to offload all layers to GPU
llm = Llama(
   model_path="./Phi-3-medium-128k-instruct-IQ4_XS.gguf",
   n_ctx=4096, 
   n_gpu_layers=-1,
   verbose=True
)

output = llm.create_chat_completion(
   messages=[
       {"role": "system", "content": "You are a helpful assistant."},
       {"role": "user", "content": "Explain quantum physics."},
   ]
)

print(output['choices'][0]['message']['content'])

Downloads last month: 12

GGUF

Model size

14B params

Architecture

phi3

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

Model tree for ogiwrghs/Phi-3-medium-128k-instruct-GGUF

Base model

microsoft/Phi-3-medium-128k-instruct

Quantized

(75)

this model