Phi-tiny-MoE-instruct GGUF

GGUF quantized version of microsoft/Phi-tiny-MoE-instruct for local inference with llama.cpp, Ollama, LM Studio, and GPT4All.

Phi-tiny-MoE is Microsoft's efficient Mixture-of-Experts language model — 16 experts, 2 active per token — delivering strong instruction-following performance in a compact, fast package.

Available Quantizations

File	Quant	Size	RAM Needed	Quality
`Phi-tiny-MoE-instruct-Q8_0.gguf`	Q8_0	4.0 GB	~6 GB	Near-lossless

How to Use

With llama.cpp

./llama-cli -m Phi-tiny-MoE-instruct-Q8_0.gguf -p "Explain quantum computing in simple terms" -n 512

With Ollama

echo 'FROM ./Phi-tiny-MoE-instruct-Q8_0.gguf' > Modelfile
ollama create phi-tiny-moe -f Modelfile
ollama run phi-tiny-moe

With LM Studio

Download the Q8_0 file
Open LM Studio → Load Model → Select the file
Start chatting

Model Details

Architecture: PhiMoE (Mixture of Experts)
Total Experts: 16
Active Experts per Token: 2
Hidden Size: 4096
Layers: 32
Attention Heads: 16
Context Length: 4096 tokens
License: MIT

Original Model

Built by Microsoft Research. See the original at microsoft/Phi-tiny-MoE-instruct.

Quantized by

Shaswata Tripathy | GitHub | Medium | LinkedIn | Hugging Face

Downloads last month: 113

GGUF

Model size

4B params

Architecture

phimoe

Hardware compatibility

8-bit

Model tree for tripathyShaswata/Phi-tiny-MoE-instruct-GGUF

Base model

microsoft/Phi-tiny-MoE-instruct

Quantized

(1)

this model