TurboMLX
TurboMLX is a companion package for mlx-lm.
Public GitHub release packaging for this preview uses the repository name Qwen3.5-TurboQuant-MLX-LM, while the Python package and CLI remain turbomlx.
TurboMLX v0.1 Research Preview currently targets Qwen3 / Qwen3.5 full-attention KVCache layers only.
Its public contract for the current preview is:
- paper-faithful key-path implementations of
TurboQuantmseandTurboQuantprod - TurboMLX-owned prompt-cache save/load wrappers for a TurboQuant KV backend
- reference math utilities, mixed-precision paper profiles, and eval helpers
Important limitations:
- values are dense by default
- end-to-end KV behavior is therefore not fully paper-equivalent unless value quantization is enabled
- runtime preview is Qwen-first and currently patches
qwen3_nextplus the sharedmlx_lm.models.basedispatch symbol - mixed-architecture Qwen stacks remain experimental as a whole; TurboQuant conversion applies only to full-attention
KVCachelayers and leaves linear-attentionArraysCachelayers untouched - rotating/sliding-window families remain unsupported in preview
v0.1 Research Previewfocuses on correctness and quality gates, not throughput leadership- preview runtime scoring defaults to
oracle_preview; a narrownative_mlxscorer preview now exists only for Qwen3 / Qwen3.5 full-attentionKVCachewithmode=mse,bits_total=4, andvalues_mode=dense native_mlxis a Stage A remediation path, not the final packed-index direct score-space scorer- the supported public runtime entrypoints are
generate_with_backend,convert_prompt_cache,save_prompt_cache, andload_prompt_cache
Bit Semantics
bits_totalis the user-facing total per-channel bit budgetmode=mse: allbits_totalgo to the Lloyd-Max main quantizermode=prod:bits_mse = bits_total - 1, plus a 1-bit QJL residualmode=prodis supported only forbits_total >= 2- default policy:
1-bit:mse2-bit:prod3/4-bit:mse
Mixed Precision Paper Profile
Paper-style non-integer effective settings such as 2.5 and 3.5 bits are
represented as explicit mixed-precision outlier profiles.
Supported profile knobs:
outlier_channelsoutlier_high_bitsregular_bitsoutlier_selection_policy
If mixed precision is disabled, all quality and memory claims are restricted to integer-bit configurations.
Release Policy
v0.1 Research Preview- correctness
- serialization stability
- prompt-cache continuity
- long-context quality helpers
- honest benchmark reporting
v1.0 Stable- all of the above
- explicit
mlx_quantdecode parity target on the reference benchmark matrix
Scope
Official preview target:
- Qwen3 / Qwen3.5 full-attention layers that use the default non-rotating
KVCachepath
Experimental:
- mixed full-attention / linear-attention Qwen stacks as a whole
- value quantization on the dense-key preview fallback path
- rotating or sliding-window cache families
Preview Eval Surface
eval-needleis a preview retrieval harness that inserts the needle only inside the haystack contexteval-jsonlis a local JSONL exact/substring harness and is not a real LongBench-E implementation- prompt-cache roundtrip support is provided by
turbomlx.save_prompt_cache()andturbomlx.load_prompt_cache(), not by upstreammlx-lmloaders
Prompt-Cache Policy
- prompt-cache files are trusted-local-only and currently remain
pickle-backed - schema
v2is the current write format and includescache_type_idmetadata - schema
v1files still load in read-only compatibility mode via deprecatedclass_pathfallback - if you load an older cache, re-save it to migrate to
v2
Qwen Preview Runtime
- Qwen3 / Qwen3.5 preview correctness uses a dense reconstructed key fallback on full-attention layers
- grouped-query attention math is delegated back to MLX native SDPA after reconstructing dense keys from the TurboQuant cache
- this path is correctness-first and intentionally preview-grade; it is not a throughput claim
--scorer-mode native_mlxis a narrower preview path layered on top of the same Qwen-first contract- supported
native_mlxconfig:- Qwen3 / Qwen3.5
- full-attention
KVCache mode=msebits_total=4values_mode=dense
- unsupported
native_mlxcombinations emit a warning once per reason and fall back to the preview scorer path - the current native scorer is still an intermediate on-device remediation step, not the final packed-index direct-score architecture
Verification
- unit and regression suite:
PYTHONPATH=src .venv/bin/python -m pytest -q - MLX smoke and benchmark authority:
.venv312 - recommended smoke target:
TURBOMLX_SMOKE_QWEN_MODEL=/Users/alican/.lmstudio/models/mlx-community/Qwen3.5-9B-MLX-4bit - benchmark methodology for current preview work:
- use at least 1 warmup run
- use at least 3 measured repeats
- report median results
- inspect
scorer_routein the output to verify whethernative_mlxactually ran or fell back - inspect
timed_generation_tokensandnative_working_set_bytesin the output when comparing preview scorer routes
Preview Bundle
- this source tree is a preview-candidate working tree, not a release-ready artifact
- use
scripts/export_preview_bundle.pyto create a clean shareable preview bundle underdist/ - use
python3 scripts/export_preview_bundle.py --output-dir /ABS/PATH/Qwen3.5-TurboQuant-MLX-LMto create a clean public source tree - distribute the exported preview bundle, not a raw workspace zip
- the exported bundle excludes local virtualenvs, cache directories, transient artifacts, and reference PDFs
Tested Runtime Stack
mlx==0.31.1mlx-lm==0.31.1
Dates
- arXiv v1:
28 Apr 2025 - OpenReview / ICLR 2026 Poster entry:
26 Jan 2026
This repository started from a blank directory plus the TurboQuant paper, so the current implementation emphasizes clean interfaces and verifiable reference math first. MLX runtime hardening is intentionally staged behind the preview release boundary.