Kimi K2.6 — JANGTQ_3L (vMLX / Apple Silicon)

3-bit MXTQ quantized build of Moonshot AI's Kimi K2.6 for the vMLX inference engine on Apple Silicon. Routed MoE experts are quantized to 3-bit MXTQ (rotation + Lloyd-Max codebook + per-row packed indices + fp16 norms); attention, routers, shared experts, embeddings, and lm_head remain at fp16.

For the upstream Moonshot model card (architecture, license, intended use, evaluations), see README_UPSTREAM.md.

Bundle facts

Format JANGTQ_3L (vMLX)
Routed experts 3-bit MXTQ, per-row packed
Other weights fp16
Pruning REAP-30 (routing-aware, ~30% of routed experts dropped)
Total size ~288 GB (304 safetensor shards)
Source size ~554 GB (FP8)
Compression vs. source ~52%

Hardware requirements

  • Apple Silicon Mac with ≥ 512 GB unified memory (tested on M3 Ultra 512 GB).
  • Smaller memory configs will not load this bundle.

Build provenance

Produced by build_kimi_jangtq3l.sh (included in this directory). Pipeline:

  1. jangreap.py — streaming layer-by-layer REAP saliency profile (~25 GB peak; replaces profile.py, which OOMs on the FP8 source).
  2. prune.py — drops the lowest-saliency 30% of routed experts per layer.
  3. convert_kimi_jangtq --profile 3L — quantizes the pruned model to MXTQ 3-bit (per-row pack handles in_feat % vals_per_u32 != 0, e.g. 7168 / 2048 with 3 bits).
  4. Tokenizer + chat_template finalization.
  5. generation_config.json patch — Kimi turn-boundary IDs:
    • <|im_end|> = 163586
    • <|im_user|> = 163587
    • <|im_assistant|> = 163588
    • eos_token_id = [163586, 163587, 163588]

Reproducing

./build_kimi_jangtq3l.sh apply-patches    # idempotent jang_tools patches
./build_kimi_jangtq3l.sh all-pruned       # download → prune → convert → finalize → patch

The script applies three required patches to the bundled jang_tools install:

  • turboquant/linear.py — per-row pack in tq_quantize_weight
  • turboquant/linear.py — per-row unpack in TurboQuantSwitchLinear._dequant_experts
  • load_jangtq.py — read in_features / input_dims from the existing module (avoids overshoot when per-row pad is used: 7168 → 7170)

Marker comment # JANG3L_PATCH_v1 makes patch application idempotent. rollback-patches restores .jang3l.bak backups.

Serving

./build_kimi_jangtq3l.sh serve
# or directly:
python -m vmlx_engine.cli serve \
  --model ~/.cache/huggingface/hub/deviad/Kimi-K2.6-JANGTQ_3L \
  --port 8012 \
  --max-tokens 4096 \
  --default-temperature 0.5 \
  --default-top-p 0.9 \
  --default-repetition-penalty 1.1 \
  --tool-call-parser moonshot \
  --enable-auto-tool-choice

OpenAI-compatible endpoints at http://127.0.0.1:8012/v1/....

Known caveats

  • Paged KV cache is incompatible with this build. Kimi uses MLA attention with a CacheList layout that the paged-cache path does not handle, producing degenerate output (e.g. only ! tokens). Do not pass --use-paged-cache, --enable-block-disk-cache, --paged-cache-block-size, or --max-cache-blocks.
  • 3-bit MXTQ + per-row pack is slower per token than 2-bit affine routes; quality is the tradeoff.
  • Tokenizer is tiktoken-based (no tokenizer.json); trust_remote_code=True is required.

Files

  • build_kimi_jangtq3l.sh — self-contained build script (patches + pipeline + serve).
  • model-*.safetensors (304 shards) — quantized weights.
  • config.json, generation_config.json, chat_template.jinja, kimi_k25_*.py, tiktoken.model — model + tokenizer config.
  • jang_config.json — vMLX/jang_tools profile metadata.
  • README_UPSTREAM.md — original Moonshot model card.
  • LICENSE — modified MIT (inherited from upstream).

License

Modified MIT, inherited from the upstream Moonshot Kimi K2.6 release. See LICENSE and README_UPSTREAM.md.

Credits

  • Upstream model: Moonshot AI — Kimi K2.6.
  • Quantization toolchain: jang_tools / vMLX (JANGQ-AI).
  • This bundle: produced locally on M3 Ultra 512 GB by user dvd.pugliese@gmail.com.
Downloads last month
1,244
Safetensors
Model size
83B params
Tensor type
U32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Deviad/Kimi-K2.6-JANGTQ_3L

Quantized
(19)
this model