llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)Kimi K2.6 GGUF โ Quantized by BatiAI
IQ3_XXS / IQ4_XS quantization of moonshotai/Kimi-K2.6 (1T total / 32B active MoE). Quantized directly from official Moonshot FP8 weights by BatiAI.
Why Kimi K2.6?
- 1T parameters (32B active) โ frontier-class open weight model
- SWE-Bench Pro 58.6 โ beats GPT-5.4 xhigh (57.7), Claude Opus 4.6 max (53.4), Gemini 3.1 Pro (54.2)
- HLE 36.4% (no tools) / 55.5% (w/ tools) โ Humanity's Last Exam frontier tier
- Agent swarm architecture โ 300 sub-agents, 4,000 coordinated steps
- 256K native context (262,144 tokens) via YARN scaling
- Native tool calling โ search, code-interpreter, web-browsing
- Modified-MIT license โ redistribution + fine-tuning allowed
- Released 2026-04-20 by Moonshot AI
Quick Start
# IQ4_XS (recommended balance, 546GB, M3 Ultra 512GB+)
ollama pull batiai/kimi-k2.6:iq4
# IQ3_XXS (smaller, 394GB, 384GB+ RAM)
ollama pull batiai/kimi-k2.6:iq3
# Q5_K_M (highest quality, 728GB, needs 768GB+ RAM)
ollama pull batiai/kimi-k2.6:q5
Available Quantizations
| Quant | Size | Min RAM | Target Hardware | Notes |
|---|---|---|---|---|
| IQ3_XXS | 394GB | 384GB | M3 Ultra 512GB / H100 node | aggressive compression, imatrix-calibrated |
| IQ4_XS | 546GB | 512GB | M3 Ultra 512GB / 8รA100 80GB | recommended balance |
| Q5_K_M | 728GB | 768GB | 2ร M3 Ultra / 8รA100 80GB / H100 node | highest quality, near-original |
โ ๏ธ Not for consumer Mac โ this is a workstation / server / frontier research model. 16-128GB Macs should use
batiai/qwen3.6-35borbatiai/minimax-m2.7instead (see comparison table below).
Hardware Reality Check
| Your System | IQ3 (394GB) | IQ4 (546GB) | Q5 (728GB) |
|---|---|---|---|
| Mac 128GB | โ Won't fit | โ | โ |
| Mac 192GB | โ Won't fit | โ | โ |
| Mac 256GB | โ ๏ธ Heavy swap (unusable) | โ | โ |
| Mac 384GB | โ ๏ธ Tight | โ | โ |
| Mac M3 Ultra 512GB | โ Comfortable | โ Usable (tight) | โ |
| 2ร M3 Ultra (cluster) | โ | โ | โ |
| 8ร A100 80GB (640GB total) | โ | โ Fast | โ |
| H100 node (640GB+) | โ Fast | โ Fast | โ Fast |
Numbers based on MoE activation patterns โ 32B active params ร 4 bytes buffer โ 130GB runtime even after quantization, plus shard headers + KV cache (at 256K context, cache alone is 30-80GB).
What BatiAI's Quantization Delivers
| BatiAI | unsloth / ubergarm | |
|---|---|---|
| Source | Direct from official Moonshot FP8 weights | Same (major providers) |
| Quantization flow | FP8 โ Q8_0 โ IQ3_XXS/IQ4_XS with imatrix (wikitext-2 calibration, 200 chunks) | Similar |
| imatrix | โ 200 chunks (quality saturation point) | Varies |
| Tool-calling preservation | โ Native template preserved | โ |
| Korean validation | โ (pending benchmark on target hardware) | โ |
| BatiAI signature | โ
general.author=BatiAI, general.url=https://flow.bati.ai |
โ |
| Pipeline | Open source โ docs/202604-large-moe-quantization.md |
Internal |
Model Comparison โ BatiAI Model Lineup
Kimi K2.6 is for frontier workstation users. For everyone else:
| Your Hardware | Best BatiAI Model | Size |
|---|---|---|
| 16GB Mac | batiai/gemma4-e4b:q4 |
4.9GB |
| 24GB Mac | batiai/gemma4-26b:iq4 |
15GB |
| 48GB Mac | batiai/qwen3.5-35b:iq4 |
22GB |
| 96GB Mac | batiai/qwen3.6-35b:iq4 |
22GB |
| 128GB Mac | batiai/minimax-m2.7:iq3 |
82GB |
| M3 Ultra 512GB / H100 | batiai/kimi-k2.6:iq4 |
509GB |
Benchmarks (source model)
Benchmark numbers from Moonshot AI's official report โ validating that aggressive quantization preserves these capabilities is pending on our end (bench.sh on M3 Ultra / H100 target).
| Benchmark | Kimi K2.6 | Comparison |
|---|---|---|
| SWE-Bench Pro | 58.6 | GPT-5.4 xhigh 57.7, Opus 4.6 max 53.4 |
| HLE (no tools) | 36.4% | frontier tier |
| HLE (w/ tools) | 55.5% | frontier tier |
| Context | 256K | YARN scaling |
| Native tool use | โ | search, code, web |
Technical Details
- Original Model: moonshotai/Kimi-K2.6
- Architecture: Mixture of Experts โ 1T total / 32B active, 61 layers, 384 experts (8 selected + 1 shared), MLA attention
- Original storage: FP8 / INT4 hybrid QAT (555GB)
- License: Modified-MIT
- Quantized with: llama.cpp
- Calibration: wikitext-2-raw, 200 chunks (quality saturation)
- Quantized by: BatiAI
Usage
llama.cpp
./llama-cli -m Kimi-K2.6-IQ4_XS.gguf \
-p "Your prompt" \
--ctx-size 65536 \
--n-gpu-layers 99
Ollama
ollama run batiai/kimi-k2.6:iq4
vLLM / TGI
Not directly compatible โ these serve FP8/BF16 safetensors. Use original moonshotai/Kimi-K2.6 for vLLM.
About BatiAI
BatiAI quantizes frontier open weight models with validated quality and transparent provenance. We built BatiFlow โ free, on-device AI automation for Mac โ and open-source our full quantization pipeline.
The Kimi K2.6 release demonstrates our pipeline handles 1T+ MoE models (most quantization providers stop at 70B). See our Kimi K2.6 quantization notes for the engineering trade-offs.
License
Quantized from moonshotai/Kimi-K2.6. License: Modified-MIT โ commercial use + redistribution allowed.
- Downloads last month
- 6,113
3-bit
4-bit
5-bit
Model tree for batiai/Kimi-K2.6-GGUF
Base model
moonshotai/Kimi-K2.6
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="batiai/Kimi-K2.6-GGUF", filename="", )