--- license: other license_name: tencent-hy-community license_link: LICENSE library_name: mlx tags: - mlx - jang - jangtq - hy3 - hunyuan - hy_v3 - moe - apple-silicon - 2bit - 295b - osaurus pipeline_tag: text-generation base_model: tencent/Hy3-preview base_model_relation: quantized ---

OsaurusAI

# Hy3-preview-JANGTQ **Tencent Hy3-preview — 79 GB on disk** (down from the ~557 GB BF16 source) — 2-bit **JANGTQ** quantization on routed experts + 8-bit affine elsewhere. - **Source:** [tencent/Hy3-preview](https://huggingface.co/tencent/Hy3-preview) (Hy3 architecture, 295B total / 21B active, BF16 native, 256K context, 80 transformer layers + 1 MTP, 192 routed experts top-8 + 1 shared) - **Quantization:** **JANGTQ** — 2-bit MXTQ codebook (Hadamard-rotated, Lloyd-Max optimized) on routed-expert weights + 8-bit affine on attention / shared expert / dense layer-0 / embed / lm_head / MTP matmuls + fp16 passthrough on RMSNorms / router gate / `expert_bias` - **MTP:** layer 80 weights preserved (`mtp_mode=preserved_disabled`); decode is one-token-per-forward until accept/reject speculative loop ships - **Bundle size:** **79 GB on-disk** across 85 shards - **Runs on:** M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+ ## What's in the bundle | Module | Source dtype | Bundle dtype | |---|---|---| | Routed experts (192 × 3 mats × 79 sparse layers, per-expert layout) | BF16 | **2-bit MXTQ** + sidecar codebook | | Attention q/k/v/o + q/k norms | BF16 | 8-bit affine g=64 | | Shared expert (gate/up/down) | BF16 | 8-bit affine g=64 | | Dense layer-0 MLP | BF16 | 8-bit affine g=64 | | `embed_tokens` / `lm_head` | BF16 | 8-bit affine g=64 | | MTP layer matmuls | BF16 | 8-bit affine g=64 (preserved_disabled) | | RMSNorms / `router.gate.weight` / `expert_bias` | BF16 / F32 | fp16 passthrough | `jangtq_runtime.safetensors` sidecar (~22 KB) for Swift runtimes — covers `(in_features={1536, 4096}, seed=42, bits=2)` codebooks + sign-flip vectors. ## Loading (Python) ```bash pip install jang-tools mlx-lm ``` ```python from jang_tools.load_jangtq import load_jangtq_model model, tokenizer = load_jangtq_model("OsaurusAI/Hy3-preview-JANGTQ") chat = tokenizer.apply_chat_template( [{"role": "user", "content": "What is 2 + 2? Answer briefly."}], tokenize=False, add_generation_prompt=True, reasoning_effort="no_think", ) ``` `load_jangtq_model` auto-registers `model_type=hy_v3` via `jang_tools.hy3` before building the MLX skeleton. The loader applies the standard SwitchGLU fused gate+up + P15 router compile + P18 QKV fusion patches automatically. Two Hy3-specific runtime fixes are baked in: 1. **fp32 lm_head**. `enable_lm_head_fp32=True` in the bundle config — `Model.__call__` dequantizes the quantized lm_head and accumulates the 4096-dim contraction in fp32 (mirrors DSV4's pattern). bf16 accumulation drifts logits by ~0.5/elem and flips top-k token picks toward high-baseline-energy junk tokens. 2. **qk_norm under JANGTQ P18 QKV fusion**. JANGTQ's QKV-fusion patch replaces the attention `__call__`; `Hy3Attention` declares `use_qk_norm=True` and uses `Hy3HeadRMSNorm` to auto-reshape flat `[B, L, n_heads * head_dim]` input to per-head shape so RMSNorm normalizes over `head_dim`, not over the entire flat dimension. Decode ~15 tok/s greedy on M5 Max 128 GB at `reasoning_effort=no_think`. ## Reasoning + tools - **Reasoning parser:** `qwen3` (extracts `...` blocks) - **Tool parser:** `hunyuan` (Tencent XML-like: `namekv`) - **Reasoning effort:** `no_think` (default) | `low` | `high` — pass via `apply_chat_template(..., reasoning_effort="…")` - **Default rendering:** template emits a closed `` for `no_think` mode; the runtime should NOT auto-open a reasoning prefix unless `low` or `high` is explicitly requested - **Cache:** `kv` (standard GQA cache; no MLA, no SSM, no sliding-window) ## Top-K runtime override `JANGTQ_TOPK_OVERRIDE=4 python serve.py` lowers per-token expert count from the trained 8 to 4 for ~10% decode speedup. Coherence holds on short prompts in our smoke tests; long-form quality is not benchmarked. The patcher refuses to set K above the trained value and logs the attribute count it modified. ## Credits - **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai)) - **Source model:** Tencent Hy3-preview team - **License:** [Tencent Hy Community License](LICENSE) — non-commercial, EU/UK/SK excluded; consult the LICENSE for full terms ## Validated runtime contract - 80 layers materialize; 79 routed-expert SwitchGLU instances hydrate via TurboQuantLinear (2-bit MXTQ). - Capabilities verify: `family=hy_v3`, `reasoning_parser=qwen3`, `tool_parser=hunyuan`, `think_in_template=False`, `supports_thinking=True`, `cache_type=kv`, `modality=text`. - Coherence smoke (M5 Max 128 GB): - "What is 2 + 2?" → `4<|hy_eos|>` (15.2 tok/s) - "The capital of France is" → top-1 ` Paris` (logit 19.13) - "def fibonacci(n):" → top-1 `\n`, top-3 includes ` return` - Hard-prompt benchmark coverage (HumanEval, MMLU, long-context) is pending. This bundle is shipped on smoke evidence; treat results beyond short prompts as preview-quality until benchmarks land. ## Runtime support matrix | Surface | Status | |---|---| | `jang-tools` Python (`load_jangtq_model`) | ✅ working — this README's load snippet | | `vmlx-swift-lm` Swift | ✅ working — `Libraries/MLXLLM/Models/Hy3.swift` + JANGTQ codebook dispatch. Same family path that ships ZAYA and Bailing/Ling. | | `vmlx_engine` Python via re-export | pending — `vmlx_engine.loaders.load_jangtq_hy3` re-export of `jang_tools.hy3.runtime.load_hy3_model` not yet wired | | MTP speculative decode | preserved-disabled — weights present in bundle, accept/reject loop not yet implemented in any JANG runtime |