| --- |
| license: other |
| license_name: tencent-hy-community |
| license_link: LICENSE |
| library_name: mlx |
| tags: |
| - mlx |
| - jang |
| - jangtq |
| - jangtq-k |
| - mixed-precision |
| - hy3 |
| - hunyuan |
| - hy_v3 |
| - moe |
| - apple-silicon |
| - 295b |
| - osaurus |
| pipeline_tag: text-generation |
| base_model: tencent/Hy3-preview |
| base_model_relation: quantized |
| --- |
| |
| <p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p> |
|
|
| # Hy3-preview-JANGTQ_K |
| |
| **Tencent Hy3-preview — 102 GB on disk** (down from ~557 GB BF16 source) — |
| **mixed-bit JANGTQ_K** quantization on routed experts + 8-bit affine |
| elsewhere. ~30 % bigger than `Hy3-preview-JANGTQ` (2-bit on routed |
| experts) in exchange for a measurable quality bump on `down_proj` |
| sensitivity, especially on long-output generation. |
|
|
| - **Source:** [tencent/Hy3-preview](https://huggingface.co/tencent/Hy3-preview) |
| (Hy3 architecture, 295 B total / 21 B active, BF16 native, 256 K |
| context, 80 transformer layers + 1 MTP, 192 routed experts top-8 + 1 |
| shared) |
| - **Quantization:** **mixed-bit MXTQ** on routed experts: |
| - `down_proj`: **4-bit** (4096-out, residual-stream sensitive) |
| - `gate_proj`: **2-bit** (gated by SwiGLU) |
| - `up_proj`: **2-bit** (multiplied with gate) |
| - attention / shared expert / dense layer-0 / embed / lm_head / MTP |
| matmuls: 8-bit affine |
| - RMSNorms / router gate / `expert_bias`: fp16 / fp32 passthrough |
| - **MTP:** layer 80 weights preserved (`mtp_mode=preserved_disabled`); |
| decode is one-token-per-forward until accept/reject speculative loop |
| ships. |
| - **Bundle size:** **102 GB on-disk** across 109 shards |
| - **Runs on:** M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+ |
|
|
| ## What's in the bundle |
|
|
| | Module | Source dtype | Bundle dtype | |
| |---|---|---| |
| | Routed experts (192 × 3 mats × 79 sparse layers, per-expert layout) | BF16 | **JANGTQ_K**: down 4-bit, gate/up 2-bit | |
| | Attention q/k/v/o + q/k norms | BF16 | 8-bit affine g=64 | |
| | Shared expert (gate/up/down) | BF16 | 8-bit affine g=64 | |
| | Dense layer-0 MLP | BF16 | 8-bit affine g=64 | |
| | `embed_tokens` / `lm_head` | BF16 | 8-bit affine g=64 | |
| | MTP layer matmuls | BF16 | 8-bit affine g=64 (preserved_disabled) | |
| | RMSNorms / `router.gate.weight` / `expert_bias` | BF16 / F32 | fp16 passthrough | |
| |
| `jangtq_runtime.safetensors` sidecar (~22 KB) for Swift runtimes — |
| covers `(in=1536, bits=4)` + `(in=4096, bits=2)` codebooks + sign-flip |
| vectors (Hy3 routed projections have asymmetric `[4096↔1536]` dims). |
| |
| ## Why mixed-bit? |
| |
| Hy3 is top-8 routing, so `JANGTQ` (uniform 2-bit) already averages |
| codebook noise across 8 experts per token and ships coherent. `JANGTQ_K` |
| spends extra bits on `down_proj` — the projection whose output enters |
| the residual stream — to give long-output generation more headroom |
| before residual noise compounds. Same scheme that ZAYA1-8B-JANGTQ_K |
| ships for a strictly harder top-1 routing setup. |
| |
| ## Loading (Python) |
| |
| ```bash |
| pip install jang-tools mlx-lm |
| ``` |
| |
| ```python |
| from jang_tools.load_jangtq import load_jangtq_model |
| |
| model, tokenizer = load_jangtq_model("OsaurusAI/Hy3-preview-JANGTQ_K") |
| |
| chat = tokenizer.apply_chat_template( |
| [{{"role": "user", "content": "What is 2 + 2? Answer briefly."}}], |
| tokenize=False, |
| add_generation_prompt=True, |
| reasoning_effort="no_think", |
| ) |
| ``` |
| |
| `load_jangtq_model` auto-registers `model_type=hy_v3` via |
| `jang_tools.hy3` before building the MLX skeleton. The loader applies |
| the standard SwitchGLU fused gate+up + P15 router compile + P18 QKV |
| fusion patches automatically. |
| |
| ## Reasoning + tools |
| |
| - **Reasoning parser:** `qwen3` (extracts `<think>...</think>` blocks) |
| - **Tool parser:** `hunyuan` (Tencent XML-like: |
| `<tool_calls><tool_call>name<tool_sep><arg_key>k</arg_key><arg_value>v</arg_value></tool_call></tool_calls>`) |
| - **Reasoning effort:** `no_think` (default) | `low` | `high` — pass via |
| `apply_chat_template(..., reasoning_effort="…")` |
| - **Cache:** `kv` (standard GQA cache) |
| |
| ## Runtime support matrix |
| |
| | Surface | Status | |
| |---|---| |
| | `jang-tools` Python (`load_jangtq_model`) | ✅ working — this README's load snippet | |
| | `vmlx-swift-lm` Swift | ✅ working — `Libraries/MLXLLM/Models/Hy3.swift` + JANGTQ dispatch. Same family path that ships ZAYA and Bailing/Ling. | |
| | `vmlx_engine` Python re-export | pending | |
| | MTP speculative decode | preserved-disabled — weights present in bundle, accept/reject loop not yet implemented | |
| |
| ## Credits |
| |
| - **Quantization + MLX runtime:** Jinho Jang (eric@osaurus.ai) |
| - **Source model:** Tencent Hy3-preview team |
| - **License:** [Tencent Hy Community License](LICENSE) — non-commercial, EU/UK/SK |
| excluded; consult the LICENSE for full terms |
|
|