Phi-tiny-MoE-instruct GGUF
GGUF quantized version of microsoft/Phi-tiny-MoE-instruct for local inference with llama.cpp, Ollama, LM Studio, and GPT4All.
Phi-tiny-MoE is Microsoft's efficient Mixture-of-Experts language model โ 16 experts, 2 active per token โ delivering strong instruction-following performance in a compact, fast package.
Available Quantizations
| File | Quant | Size | RAM Needed | Quality |
|---|---|---|---|---|
Phi-tiny-MoE-instruct-Q8_0.gguf |
Q8_0 | 4.0 GB | ~6 GB | Near-lossless |
How to Use
With llama.cpp
./llama-cli -m Phi-tiny-MoE-instruct-Q8_0.gguf -p "Explain quantum computing in simple terms" -n 512
With Ollama
echo 'FROM ./Phi-tiny-MoE-instruct-Q8_0.gguf' > Modelfile
ollama create phi-tiny-moe -f Modelfile
ollama run phi-tiny-moe
With LM Studio
- Download the Q8_0 file
- Open LM Studio โ Load Model โ Select the file
- Start chatting
Model Details
- Architecture: PhiMoE (Mixture of Experts)
- Total Experts: 16
- Active Experts per Token: 2
- Hidden Size: 4096
- Layers: 32
- Attention Heads: 16
- Context Length: 4096 tokens
- License: MIT
Original Model
Built by Microsoft Research. See the original at microsoft/Phi-tiny-MoE-instruct.
Quantized by
Shaswata Tripathy | GitHub | Medium | LinkedIn | Hugging Face
- Downloads last month
- 113
Hardware compatibility
Log In to add your hardware
8-bit
Model tree for tripathyShaswata/Phi-tiny-MoE-instruct-GGUF
Base model
microsoft/Phi-tiny-MoE-instruct