File size: 4,642 Bytes
9c82a0e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
license: other
license_name: tencent-hy-community
license_link: LICENSE
library_name: mlx
tags:
  - mlx
  - jang
  - jangtq
  - jangtq-k
  - mixed-precision
  - hy3
  - hunyuan
  - hy_v3
  - moe
  - apple-silicon
  - 295b
  - osaurus
pipeline_tag: text-generation
base_model: tencent/Hy3-preview
base_model_relation: quantized
---

<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>

# Hy3-preview-JANGTQ_K

**Tencent Hy3-preview — 102 GB on disk** (down from ~557 GB BF16 source) —
**mixed-bit JANGTQ_K** quantization on routed experts + 8-bit affine
elsewhere. ~30 % bigger than `Hy3-preview-JANGTQ` (2-bit on routed
experts) in exchange for a measurable quality bump on `down_proj`
sensitivity, especially on long-output generation.

- **Source:** [tencent/Hy3-preview](https://huggingface.co/tencent/Hy3-preview)
  (Hy3 architecture, 295 B total / 21 B active, BF16 native, 256 K
  context, 80 transformer layers + 1 MTP, 192 routed experts top-8 + 1
  shared)
- **Quantization:** **mixed-bit MXTQ** on routed experts:
  - `down_proj`: **4-bit** (4096-out, residual-stream sensitive)
  - `gate_proj`: **2-bit** (gated by SwiGLU)
  - `up_proj`:   **2-bit** (multiplied with gate)
  - attention / shared expert / dense layer-0 / embed / lm_head / MTP
    matmuls: 8-bit affine
  - RMSNorms / router gate / `expert_bias`: fp16 / fp32 passthrough
- **MTP:** layer 80 weights preserved (`mtp_mode=preserved_disabled`);
  decode is one-token-per-forward until accept/reject speculative loop
  ships.
- **Bundle size:** **102 GB on-disk** across 109 shards
- **Runs on:** M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+

## What's in the bundle

| Module | Source dtype | Bundle dtype |
|---|---|---|
| Routed experts (192 × 3 mats × 79 sparse layers, per-expert layout) | BF16 | **JANGTQ_K**: down 4-bit, gate/up 2-bit |
| Attention q/k/v/o + q/k norms | BF16 | 8-bit affine g=64 |
| Shared expert (gate/up/down) | BF16 | 8-bit affine g=64 |
| Dense layer-0 MLP | BF16 | 8-bit affine g=64 |
| `embed_tokens` / `lm_head` | BF16 | 8-bit affine g=64 |
| MTP layer matmuls | BF16 | 8-bit affine g=64 (preserved_disabled) |
| RMSNorms / `router.gate.weight` / `expert_bias` | BF16 / F32 | fp16 passthrough |

`jangtq_runtime.safetensors` sidecar (~22 KB) for Swift runtimes —
covers `(in=1536, bits=4)` + `(in=4096, bits=2)` codebooks + sign-flip
vectors (Hy3 routed projections have asymmetric `[4096↔1536]` dims).

## Why mixed-bit?

Hy3 is top-8 routing, so `JANGTQ` (uniform 2-bit) already averages
codebook noise across 8 experts per token and ships coherent. `JANGTQ_K`
spends extra bits on `down_proj` — the projection whose output enters
the residual stream — to give long-output generation more headroom
before residual noise compounds. Same scheme that ZAYA1-8B-JANGTQ_K
ships for a strictly harder top-1 routing setup.

## Loading (Python)

```bash
pip install jang-tools mlx-lm
```

```python
from jang_tools.load_jangtq import load_jangtq_model

model, tokenizer = load_jangtq_model("OsaurusAI/Hy3-preview-JANGTQ_K")

chat = tokenizer.apply_chat_template(
    [{{"role": "user", "content": "What is 2 + 2? Answer briefly."}}],
    tokenize=False,
    add_generation_prompt=True,
    reasoning_effort="no_think",
)
```

`load_jangtq_model` auto-registers `model_type=hy_v3` via
`jang_tools.hy3` before building the MLX skeleton. The loader applies
the standard SwitchGLU fused gate+up + P15 router compile + P18 QKV
fusion patches automatically.

## Reasoning + tools

- **Reasoning parser:** `qwen3` (extracts `<think>...</think>` blocks)
- **Tool parser:** `hunyuan` (Tencent XML-like:
  `<tool_calls><tool_call>name<tool_sep><arg_key>k</arg_key><arg_value>v</arg_value></tool_call></tool_calls>`)
- **Reasoning effort:** `no_think` (default) | `low` | `high` — pass via
  `apply_chat_template(..., reasoning_effort="…")`
- **Cache:** `kv` (standard GQA cache)

## Runtime support matrix

| Surface | Status |
|---|---|
| `jang-tools` Python (`load_jangtq_model`) | ✅ working — this README's load snippet |
| `vmlx-swift-lm` Swift | ✅ working — `Libraries/MLXLLM/Models/Hy3.swift` + JANGTQ dispatch. Same family path that ships ZAYA and Bailing/Ling. |
| `vmlx_engine` Python re-export | pending |
| MTP speculative decode | preserved-disabled — weights present in bundle, accept/reject loop not yet implemented |

## Credits

- **Quantization + MLX runtime:** Jinho Jang (eric@osaurus.ai)
- **Source model:** Tencent Hy3-preview team
- **License:** [Tencent Hy Community License](LICENSE) — non-commercial, EU/UK/SK
  excluded; consult the LICENSE for full terms