File size: 6,089 Bytes
f8c5df9 f9f278b f8c5df9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | ---
license: other
license_name: tencent-hy-community
license_link: LICENSE
library_name: mlx
tags:
- mlx
- jang
- jangtq
- hy3
- hunyuan
- hy_v3
- moe
- apple-silicon
- 2bit
- 295b
- osaurus
pipeline_tag: text-generation
base_model: tencent/Hy3-preview
base_model_relation: quantized
---
<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>
# Hy3-preview-JANGTQ
**Tencent Hy3-preview — 79 GB on disk** (down from the ~557 GB BF16 source) —
2-bit **JANGTQ** quantization on routed experts + 8-bit affine elsewhere.
- **Source:** [tencent/Hy3-preview](https://huggingface.co/tencent/Hy3-preview)
(Hy3 architecture, 295B total / 21B active, BF16 native, 256K context,
80 transformer layers + 1 MTP, 192 routed experts top-8 + 1 shared)
- **Quantization:** **JANGTQ** — 2-bit MXTQ codebook (Hadamard-rotated,
Lloyd-Max optimized) on routed-expert weights + 8-bit affine on
attention / shared expert / dense layer-0 / embed / lm_head / MTP
matmuls + fp16 passthrough on RMSNorms / router gate / `expert_bias`
- **MTP:** layer 80 weights preserved (`mtp_mode=preserved_disabled`);
decode is one-token-per-forward until accept/reject speculative loop
ships
- **Bundle size:** **79 GB on-disk** across 85 shards
- **Runs on:** M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+
## What's in the bundle
| Module | Source dtype | Bundle dtype |
|---|---|---|
| Routed experts (192 × 3 mats × 79 sparse layers, per-expert layout) | BF16 | **2-bit MXTQ** + sidecar codebook |
| Attention q/k/v/o + q/k norms | BF16 | 8-bit affine g=64 |
| Shared expert (gate/up/down) | BF16 | 8-bit affine g=64 |
| Dense layer-0 MLP | BF16 | 8-bit affine g=64 |
| `embed_tokens` / `lm_head` | BF16 | 8-bit affine g=64 |
| MTP layer matmuls | BF16 | 8-bit affine g=64 (preserved_disabled) |
| RMSNorms / `router.gate.weight` / `expert_bias` | BF16 / F32 | fp16 passthrough |
`jangtq_runtime.safetensors` sidecar (~22 KB) for Swift runtimes — covers
`(in_features={1536, 4096}, seed=42, bits=2)` codebooks + sign-flip vectors.
## Loading (Python)
```bash
pip install jang-tools mlx-lm
```
```python
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model("OsaurusAI/Hy3-preview-JANGTQ")
chat = tokenizer.apply_chat_template(
[{"role": "user", "content": "What is 2 + 2? Answer briefly."}],
tokenize=False,
add_generation_prompt=True,
reasoning_effort="no_think",
)
```
`load_jangtq_model` auto-registers `model_type=hy_v3` via
`jang_tools.hy3` before building the MLX skeleton. The loader applies
the standard SwitchGLU fused gate+up + P15 router compile + P18 QKV
fusion patches automatically. Two Hy3-specific runtime fixes are baked
in:
1. **fp32 lm_head**. `enable_lm_head_fp32=True` in the bundle config —
`Model.__call__` dequantizes the quantized lm_head and accumulates
the 4096-dim contraction in fp32 (mirrors DSV4's pattern). bf16
accumulation drifts logits by ~0.5/elem and flips top-k token picks
toward high-baseline-energy junk tokens.
2. **qk_norm under JANGTQ P18 QKV fusion**. JANGTQ's QKV-fusion patch
replaces the attention `__call__`; `Hy3Attention` declares
`use_qk_norm=True` and uses `Hy3HeadRMSNorm` to auto-reshape flat
`[B, L, n_heads * head_dim]` input to per-head shape so RMSNorm
normalizes over `head_dim`, not over the entire flat dimension.
Decode ~15 tok/s greedy on M5 Max 128 GB at `reasoning_effort=no_think`.
## Reasoning + tools
- **Reasoning parser:** `qwen3` (extracts `<think>...</think>` blocks)
- **Tool parser:** `hunyuan` (Tencent XML-like:
`<tool_calls><tool_call>name<tool_sep><arg_key>k</arg_key><arg_value>v</arg_value></tool_call></tool_calls>`)
- **Reasoning effort:** `no_think` (default) | `low` | `high` — pass via
`apply_chat_template(..., reasoning_effort="…")`
- **Default rendering:** template emits a closed `<think></think>` for
`no_think` mode; the runtime should NOT auto-open a reasoning prefix
unless `low` or `high` is explicitly requested
- **Cache:** `kv` (standard GQA cache; no MLA, no SSM, no sliding-window)
## Top-K runtime override
`JANGTQ_TOPK_OVERRIDE=4 python serve.py` lowers per-token expert count
from the trained 8 to 4 for ~10% decode speedup. Coherence holds on
short prompts in our smoke tests; long-form quality is not benchmarked.
The patcher refuses to set K above the trained value and logs the
attribute count it modified.
## Credits
- **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
- **Source model:** Tencent Hy3-preview team
- **License:** [Tencent Hy Community License](LICENSE) — non-commercial, EU/UK/SK
excluded; consult the LICENSE for full terms
## Validated runtime contract
- 80 layers materialize; 79 routed-expert SwitchGLU instances hydrate via
TurboQuantLinear (2-bit MXTQ).
- Capabilities verify: `family=hy_v3`, `reasoning_parser=qwen3`,
`tool_parser=hunyuan`, `think_in_template=False`, `supports_thinking=True`,
`cache_type=kv`, `modality=text`.
- Coherence smoke (M5 Max 128 GB):
- "What is 2 + 2?" → `4<|hy_eos|>` (15.2 tok/s)
- "The capital of France is" → top-1 ` Paris` (logit 19.13)
- "def fibonacci(n):" → top-1 `\n`, top-3 includes ` return`
- Hard-prompt benchmark coverage (HumanEval, MMLU, long-context) is
pending. This bundle is shipped on smoke evidence; treat results
beyond short prompts as preview-quality until benchmarks land.
## Runtime support matrix
| Surface | Status |
|---|---|
| `jang-tools` Python (`load_jangtq_model`) | ✅ working — this README's load snippet |
| `vmlx-swift-lm` Swift | ✅ working — `Libraries/MLXLLM/Models/Hy3.swift` + JANGTQ codebook dispatch. Same family path that ships ZAYA and Bailing/Ling. |
| `vmlx_engine` Python via re-export | pending — `vmlx_engine.loaders.load_jangtq_hy3` re-export of `jang_tools.hy3.runtime.load_hy3_model` not yet wired |
| MTP speculative decode | preserved-disabled — weights present in bundle, accept/reject loop not yet implemented in any JANG runtime |
|