Kimi K2.6 — JANGTQ_3L (vMLX / Apple Silicon)
3-bit MXTQ quantized build of Moonshot AI's Kimi K2.6 for the vMLX inference engine on Apple Silicon. Routed MoE experts are quantized to 3-bit MXTQ (rotation + Lloyd-Max codebook + per-row packed indices + fp16 norms); attention, routers, shared experts, embeddings, and lm_head remain at fp16.
For the upstream Moonshot model card (architecture, license, intended use, evaluations), see README_UPSTREAM.md.
Bundle facts
| Format | JANGTQ_3L (vMLX) |
| Routed experts | 3-bit MXTQ, per-row packed |
| Other weights | fp16 |
| Pruning | REAP-30 (routing-aware, ~30% of routed experts dropped) |
| Total size | ~288 GB (304 safetensor shards) |
| Source size | ~554 GB (FP8) |
| Compression vs. source | ~52% |
Hardware requirements
- Apple Silicon Mac with ≥ 512 GB unified memory (tested on M3 Ultra 512 GB).
- Smaller memory configs will not load this bundle.
Build provenance
Produced by build_kimi_jangtq3l.sh (included in this directory). Pipeline:
jangreap.py— streaming layer-by-layer REAP saliency profile (~25 GB peak; replacesprofile.py, which OOMs on the FP8 source).prune.py— drops the lowest-saliency 30% of routed experts per layer.convert_kimi_jangtq --profile 3L— quantizes the pruned model to MXTQ 3-bit (per-row pack handlesin_feat % vals_per_u32 != 0, e.g. 7168 / 2048 with 3 bits).- Tokenizer +
chat_templatefinalization. generation_config.jsonpatch — Kimi turn-boundary IDs:<|im_end|>= 163586<|im_user|>= 163587<|im_assistant|>= 163588eos_token_id = [163586, 163587, 163588]
Reproducing
./build_kimi_jangtq3l.sh apply-patches # idempotent jang_tools patches
./build_kimi_jangtq3l.sh all-pruned # download → prune → convert → finalize → patch
The script applies three required patches to the bundled jang_tools install:
turboquant/linear.py— per-row pack intq_quantize_weightturboquant/linear.py— per-row unpack inTurboQuantSwitchLinear._dequant_expertsload_jangtq.py— readin_features/input_dimsfrom the existing module (avoids overshoot when per-row pad is used: 7168 → 7170)
Marker comment # JANG3L_PATCH_v1 makes patch application idempotent. rollback-patches restores .jang3l.bak backups.
Serving
./build_kimi_jangtq3l.sh serve
# or directly:
python -m vmlx_engine.cli serve \
--model ~/.cache/huggingface/hub/deviad/Kimi-K2.6-JANGTQ_3L \
--port 8012 \
--max-tokens 4096 \
--default-temperature 0.5 \
--default-top-p 0.9 \
--default-repetition-penalty 1.1 \
--tool-call-parser moonshot \
--enable-auto-tool-choice
OpenAI-compatible endpoints at http://127.0.0.1:8012/v1/....
Known caveats
- Paged KV cache is incompatible with this build. Kimi uses MLA attention with a
CacheListlayout that the paged-cache path does not handle, producing degenerate output (e.g. only!tokens). Do not pass--use-paged-cache,--enable-block-disk-cache,--paged-cache-block-size, or--max-cache-blocks. - 3-bit MXTQ + per-row pack is slower per token than 2-bit affine routes; quality is the tradeoff.
- Tokenizer is tiktoken-based (no
tokenizer.json);trust_remote_code=Trueis required.
Files
build_kimi_jangtq3l.sh— self-contained build script (patches + pipeline + serve).model-*.safetensors(304 shards) — quantized weights.config.json,generation_config.json,chat_template.jinja,kimi_k25_*.py,tiktoken.model— model + tokenizer config.jang_config.json— vMLX/jang_tools profile metadata.README_UPSTREAM.md— original Moonshot model card.LICENSE— modified MIT (inherited from upstream).
License
Modified MIT, inherited from the upstream Moonshot Kimi K2.6 release. See LICENSE and README_UPSTREAM.md.
Credits
- Upstream model: Moonshot AI — Kimi K2.6.
- Quantization toolchain:
jang_tools/ vMLX (JANGQ-AI). - This bundle: produced locally on M3 Ultra 512 GB by user
dvd.pugliese@gmail.com.
- Downloads last month
- 1,244
3-bit
Model tree for Deviad/Kimi-K2.6-JANGTQ_3L
Base model
moonshotai/Kimi-K2-Instruct