Phi-3 Medium 128k Instruct (GGUF)

GGUF License Quantized by

This repository provides GGUF quantized models for
Microsoft Phi-3 Medium 128k Instruct.

Quantization was performed using llama.cpp with Importance Matrix (IQ) techniques to optimize VRAM usage, inference speed, and model quality.


πŸ“Š Benchmarks (NVIDIA T4)

Benchmarks were collected on Google Colab (Tesla T4) using llama-bench and llama-perplexity.

Quant Size Speed Perplexity (WT2) Notes
IQ4_XS 7.02 GB 24.40 t/s 4.64 βœ… Best overall
Q5_K_M 9.38 GB 13.04 t/s 4.60 High VRAM
IQ3_M 6.03 GB 14.77 t/s 6.36 Low VRAM
Q2_K 4.79 GB 17.73 t/s 76.01 ❌ Unusable

Notes

  • IQ4_XS delivers ~87% higher throughput than Q5_K_M with negligible quality loss.
  • 2-bit quantization causes severe degradation on Phi-3 Medium.

πŸ“¦ Available Files

File Quant Size Est. RAM Use Case
Phi-3-medium-128k-instruct-Q5_K_M.gguf Q5_K_M 10.0 GB ~12 GB Max quality
Phi-3-medium-128k-instruct-IQ4_XS.gguf IQ4_XS 8.0 GB ~10 GB Recommended
Phi-3-medium-128k-instruct-IQ3_M.gguf IQ3_M 6.5 GB ~8 GB Low VRAM
Phi-3-medium-128k-instruct-Q2_K.gguf Q2_K 4.8 GB ~6 GB Testing only

RAM estimates include KV-cache overhead.


πŸ’‘ Quantization Info

Importance Matrix (IQ) quantization uses calibration data to preserve the most important weights, significantly reducing quality loss compared to standard K-quant methods at similar sizes.

Recommended: IQ4_XS
Fits in ~8GB VRAM and sustains 20+ tokens/sec on consumer GPUs (RTX 3060 / 3070 / 4060).


πŸš€ Usage

⚠️ Choose ONE of the following options. Do NOT run both.


Option 1: CLI (llama.cpp)

Use this if you want an interactive terminal session.

./llama-cli \
  -m Phi-3-medium-128k-instruct-IQ4_XS.gguf \
  -n -1 \
  --color \
  -cnv

Option 2: Python (llama-cpp-python)

from llama_cpp import Llama

# Set gpu_layers to -1 to offload all layers to GPU
llm = Llama(
   model_path="./Phi-3-medium-128k-instruct-IQ4_XS.gguf",
   n_ctx=4096, 
   n_gpu_layers=-1,
   verbose=True
)

output = llm.create_chat_completion(
   messages=[
       {"role": "system", "content": "You are a helpful assistant."},
       {"role": "user", "content": "Explain quantum physics."},
   ]
)

print(output['choices'][0]['message']['content'])
Downloads last month
12
GGUF
Model size
14B params
Architecture
phi3
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ogiwrghs/Phi-3-medium-128k-instruct-GGUF

Quantized
(75)
this model