Phi-tiny-MoE-instruct GGUF

GGUF quantized version of microsoft/Phi-tiny-MoE-instruct for local inference with llama.cpp, Ollama, LM Studio, and GPT4All.

Phi-tiny-MoE is Microsoft's efficient Mixture-of-Experts language model โ€” 16 experts, 2 active per token โ€” delivering strong instruction-following performance in a compact, fast package.

Available Quantizations

File Quant Size RAM Needed Quality
Phi-tiny-MoE-instruct-Q8_0.gguf Q8_0 4.0 GB ~6 GB Near-lossless

How to Use

With llama.cpp

./llama-cli -m Phi-tiny-MoE-instruct-Q8_0.gguf -p "Explain quantum computing in simple terms" -n 512

With Ollama

echo 'FROM ./Phi-tiny-MoE-instruct-Q8_0.gguf' > Modelfile
ollama create phi-tiny-moe -f Modelfile
ollama run phi-tiny-moe

With LM Studio

  1. Download the Q8_0 file
  2. Open LM Studio โ†’ Load Model โ†’ Select the file
  3. Start chatting

Model Details

  • Architecture: PhiMoE (Mixture of Experts)
  • Total Experts: 16
  • Active Experts per Token: 2
  • Hidden Size: 4096
  • Layers: 32
  • Attention Heads: 16
  • Context Length: 4096 tokens
  • License: MIT

Original Model

Built by Microsoft Research. See the original at microsoft/Phi-tiny-MoE-instruct.

Quantized by

Shaswata Tripathy | GitHub | Medium | LinkedIn | Hugging Face

Downloads last month
113
GGUF
Model size
4B params
Architecture
phimoe
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tripathyShaswata/Phi-tiny-MoE-instruct-GGUF

Quantized
(1)
this model