Kimi K2.6 — JANGTQ_3L (vMLX / Apple Silicon)

3-bit MXTQ quantized build of Moonshot AI's Kimi K2.6 for the vMLX inference engine on Apple Silicon. Routed MoE experts are quantized to 3-bit MXTQ (rotation + Lloyd-Max codebook + per-row packed indices + fp16 norms); attention, routers, shared experts, embeddings, and lm_head remain at fp16.

For the upstream Moonshot model card (architecture, license, intended use, evaluations), see README_UPSTREAM.md.

Bundle facts


Format	JANGTQ_3L (vMLX)
Routed experts	3-bit MXTQ, per-row packed
Other weights	fp16
Pruning	REAP-30 (routing-aware, ~30% of routed experts dropped)
Total size	~288 GB (304 safetensor shards)
Source size	~554 GB (FP8)
Compression vs. source	~52%

Hardware requirements

Apple Silicon Mac with ≥ 512 GB unified memory (tested on M3 Ultra 512 GB).
Smaller memory configs will not load this bundle.

Build provenance

Produced by build_kimi_jangtq3l.sh (included in this directory). Pipeline:

jangreap.py — streaming layer-by-layer REAP saliency profile (~25 GB peak; replaces profile.py, which OOMs on the FP8 source).
prune.py — drops the lowest-saliency 30% of routed experts per layer.
convert_kimi_jangtq --profile 3L — quantizes the pruned model to MXTQ 3-bit (per-row pack handles in_feat % vals_per_u32 != 0, e.g. 7168 / 2048 with 3 bits).
Tokenizer + chat_template finalization.
generation_config.json patch — Kimi turn-boundary IDs:
- <|im_end|> = 163586
- <|im_user|> = 163587
- <|im_assistant|> = 163588
- eos_token_id = [163586, 163587, 163588]

Reproducing

./build_kimi_jangtq3l.sh apply-patches    # idempotent jang_tools patches
./build_kimi_jangtq3l.sh all-pruned       # download → prune → convert → finalize → patch

The script applies three required patches to the bundled jang_tools install:

turboquant/linear.py — per-row pack in tq_quantize_weight
turboquant/linear.py — per-row unpack in TurboQuantSwitchLinear._dequant_experts
load_jangtq.py — read in_features / input_dims from the existing module (avoids overshoot when per-row pad is used: 7168 → 7170)

Marker comment # JANG3L_PATCH_v1 makes patch application idempotent. rollback-patches restores .jang3l.bak backups.

Serving

./build_kimi_jangtq3l.sh serve
# or directly:
python -m vmlx_engine.cli serve \
  --model ~/.cache/huggingface/hub/deviad/Kimi-K2.6-JANGTQ_3L \
  --port 8012 \
  --max-tokens 4096 \
  --default-temperature 0.5 \
  --default-top-p 0.9 \
  --default-repetition-penalty 1.1 \
  --tool-call-parser moonshot \
  --enable-auto-tool-choice

OpenAI-compatible endpoints at http://127.0.0.1:8012/v1/....

Known caveats

Paged KV cache is incompatible with this build. Kimi uses MLA attention with a CacheList layout that the paged-cache path does not handle, producing degenerate output (e.g. only ! tokens). Do not pass --use-paged-cache, --enable-block-disk-cache, --paged-cache-block-size, or --max-cache-blocks.
3-bit MXTQ + per-row pack is slower per token than 2-bit affine routes; quality is the tradeoff.
Tokenizer is tiktoken-based (no tokenizer.json); trust_remote_code=True is required.

Files

build_kimi_jangtq3l.sh — self-contained build script (patches + pipeline + serve).
model-*.safetensors (304 shards) — quantized weights.
config.json, generation_config.json, chat_template.jinja, kimi_k25_*.py, tiktoken.model — model + tokenizer config.
jang_config.json — vMLX/jang_tools profile metadata.
README_UPSTREAM.md — original Moonshot model card.
LICENSE — modified MIT (inherited from upstream).

License

Modified MIT, inherited from the upstream Moonshot Kimi K2.6 release. See LICENSE and README_UPSTREAM.md.

Credits

Upstream model: Moonshot AI — Kimi K2.6.
Quantization toolchain: jang_tools / vMLX (JANGQ-AI).
This bundle: produced locally on M3 Ultra 512 GB by user dvd.pugliese@gmail.com.

Downloads last month: 1,244

Safetensors

Model size

83B params

Tensor type

U32

F16

MLX

Hardware compatibility

3-bit

Model tree for Deviad/Kimi-K2.6-JANGTQ_3L

Base model

moonshotai/Kimi-K2-Instruct

Quantized

(19)

this model