TurboMLX

TurboMLX is a companion package for mlx-lm.

Public GitHub release packaging for this preview uses the repository name Qwen3.5-TurboQuant-MLX-LM, while the Python package and CLI remain turbomlx.

TurboMLX v0.1 Research Preview currently targets Qwen3 / Qwen3.5 full-attention KVCache layers only.

Its public contract for the current preview is:

paper-faithful key-path implementations of TurboQuantmse and TurboQuantprod
TurboMLX-owned prompt-cache save/load wrappers for a TurboQuant KV backend
reference math utilities, mixed-precision paper profiles, and eval helpers

Important limitations:

values are dense by default
end-to-end KV behavior is therefore not fully paper-equivalent unless value quantization is enabled
runtime preview is Qwen-first and currently patches qwen3_next plus the shared mlx_lm.models.base dispatch symbol
mixed-architecture Qwen stacks remain experimental as a whole; TurboQuant conversion applies only to full-attention KVCache layers and leaves linear-attention ArraysCache layers untouched
rotating/sliding-window families remain unsupported in preview
v0.1 Research Preview focuses on correctness and quality gates, not throughput leadership
preview runtime scoring defaults to oracle_preview; a narrow native_mlx scorer preview now exists only for Qwen3 / Qwen3.5 full-attention KVCache with mode=mse, bits_total=4, and values_mode=dense
native_mlx is a Stage A remediation path, not the final packed-index direct score-space scorer
the supported public runtime entrypoints are generate_with_backend, convert_prompt_cache, save_prompt_cache, and load_prompt_cache

Bit Semantics

bits_total is the user-facing total per-channel bit budget
mode=mse: all bits_total go to the Lloyd-Max main quantizer
mode=prod: bits_mse = bits_total - 1, plus a 1-bit QJL residual
mode=prod is supported only for bits_total >= 2
default policy:
- 1-bit: mse
- 2-bit: prod
- 3/4-bit: mse

Mixed Precision Paper Profile

Paper-style non-integer effective settings such as 2.5 and 3.5 bits are represented as explicit mixed-precision outlier profiles.

Supported profile knobs:

outlier_channels
outlier_high_bits
regular_bits
outlier_selection_policy

If mixed precision is disabled, all quality and memory claims are restricted to integer-bit configurations.

Release Policy

v0.1 Research Preview
- correctness
- serialization stability
- prompt-cache continuity
- long-context quality helpers
- honest benchmark reporting
v1.0 Stable
- all of the above
- explicit mlx_quant decode parity target on the reference benchmark matrix

Scope

Official preview target:

Qwen3 / Qwen3.5 full-attention layers that use the default non-rotating KVCache path

Experimental:

mixed full-attention / linear-attention Qwen stacks as a whole
value quantization on the dense-key preview fallback path
rotating or sliding-window cache families

Preview Eval Surface

eval-needle is a preview retrieval harness that inserts the needle only inside the haystack context
eval-jsonl is a local JSONL exact/substring harness and is not a real LongBench-E implementation
prompt-cache roundtrip support is provided by turbomlx.save_prompt_cache() and turbomlx.load_prompt_cache(), not by upstream mlx-lm loaders

Prompt-Cache Policy

prompt-cache files are trusted-local-only and currently remain pickle-backed
schema v2 is the current write format and includes cache_type_id metadata
schema v1 files still load in read-only compatibility mode via deprecated class_path fallback
if you load an older cache, re-save it to migrate to v2

Qwen Preview Runtime

Qwen3 / Qwen3.5 preview correctness uses a dense reconstructed key fallback on full-attention layers
grouped-query attention math is delegated back to MLX native SDPA after reconstructing dense keys from the TurboQuant cache
this path is correctness-first and intentionally preview-grade; it is not a throughput claim
--scorer-mode native_mlx is a narrower preview path layered on top of the same Qwen-first contract
supported native_mlx config:
- Qwen3 / Qwen3.5
- full-attention KVCache
- mode=mse
- bits_total=4
- values_mode=dense
unsupported native_mlx combinations emit a warning once per reason and fall back to the preview scorer path
the current native scorer is still an intermediate on-device remediation step, not the final packed-index direct-score architecture

Verification

unit and regression suite: PYTHONPATH=src .venv/bin/python -m pytest -q
MLX smoke and benchmark authority: .venv312
recommended smoke target: TURBOMLX_SMOKE_QWEN_MODEL=/Users/alican/.lmstudio/models/mlx-community/Qwen3.5-9B-MLX-4bit
benchmark methodology for current preview work:
- use at least 1 warmup run
- use at least 3 measured repeats
- report median results
- inspect scorer_route in the output to verify whether native_mlx actually ran or fell back
- inspect timed_generation_tokens and native_working_set_bytes in the output when comparing preview scorer routes

Preview Bundle

this source tree is a preview-candidate working tree, not a release-ready artifact
use scripts/export_preview_bundle.py to create a clean shareable preview bundle under dist/
use python3 scripts/export_preview_bundle.py --output-dir /ABS/PATH/Qwen3.5-TurboQuant-MLX-LM to create a clean public source tree
distribute the exported preview bundle, not a raw workspace zip
the exported bundle excludes local virtualenvs, cache directories, transient artifacts, and reference PDFs

Tested Runtime Stack

mlx==0.31.1
mlx-lm==0.31.1

Dates

arXiv v1: 28 Apr 2025
OpenReview / ICLR 2026 Poster entry: 26 Jan 2026

This repository started from a blank directory plus the TurboQuant paper, so the current implementation emphasizes clean interfaces and verifiable reference math first. MLX runtime hardening is intentionally staged behind the preview release boundary.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support