File size: 6,089 Bytes
f8c5df9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f9f278b
 
f8c5df9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
license: other
license_name: tencent-hy-community
license_link: LICENSE
library_name: mlx
tags:
  - mlx
  - jang
  - jangtq
  - hy3
  - hunyuan
  - hy_v3
  - moe
  - apple-silicon
  - 2bit
  - 295b
  - osaurus
pipeline_tag: text-generation
base_model: tencent/Hy3-preview
base_model_relation: quantized
---

<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>

# Hy3-preview-JANGTQ

**Tencent Hy3-preview — 79 GB on disk** (down from the ~557 GB BF16 source) —
2-bit **JANGTQ** quantization on routed experts + 8-bit affine elsewhere.

- **Source:** [tencent/Hy3-preview](https://huggingface.co/tencent/Hy3-preview)
  (Hy3 architecture, 295B total / 21B active, BF16 native, 256K context,
  80 transformer layers + 1 MTP, 192 routed experts top-8 + 1 shared)
- **Quantization:** **JANGTQ** — 2-bit MXTQ codebook (Hadamard-rotated,
  Lloyd-Max optimized) on routed-expert weights + 8-bit affine on
  attention / shared expert / dense layer-0 / embed / lm_head / MTP
  matmuls + fp16 passthrough on RMSNorms / router gate / `expert_bias`
- **MTP:** layer 80 weights preserved (`mtp_mode=preserved_disabled`);
  decode is one-token-per-forward until accept/reject speculative loop
  ships
- **Bundle size:** **79 GB on-disk** across 85 shards
- **Runs on:** M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+

## What's in the bundle

| Module | Source dtype | Bundle dtype |
|---|---|---|
| Routed experts (192 × 3 mats × 79 sparse layers, per-expert layout) | BF16 | **2-bit MXTQ** + sidecar codebook |
| Attention q/k/v/o + q/k norms | BF16 | 8-bit affine g=64 |
| Shared expert (gate/up/down) | BF16 | 8-bit affine g=64 |
| Dense layer-0 MLP | BF16 | 8-bit affine g=64 |
| `embed_tokens` / `lm_head` | BF16 | 8-bit affine g=64 |
| MTP layer matmuls | BF16 | 8-bit affine g=64 (preserved_disabled) |
| RMSNorms / `router.gate.weight` / `expert_bias` | BF16 / F32 | fp16 passthrough |

`jangtq_runtime.safetensors` sidecar (~22 KB) for Swift runtimes — covers
`(in_features={1536, 4096}, seed=42, bits=2)` codebooks + sign-flip vectors.

## Loading (Python)

```bash
pip install jang-tools mlx-lm
```

```python
from jang_tools.load_jangtq import load_jangtq_model

model, tokenizer = load_jangtq_model("OsaurusAI/Hy3-preview-JANGTQ")

chat = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is 2 + 2? Answer briefly."}],
    tokenize=False,
    add_generation_prompt=True,
    reasoning_effort="no_think",
)
```

`load_jangtq_model` auto-registers `model_type=hy_v3` via
`jang_tools.hy3` before building the MLX skeleton. The loader applies
the standard SwitchGLU fused gate+up + P15 router compile + P18 QKV
fusion patches automatically. Two Hy3-specific runtime fixes are baked
in:

1. **fp32 lm_head**. `enable_lm_head_fp32=True` in the bundle config —
   `Model.__call__` dequantizes the quantized lm_head and accumulates
   the 4096-dim contraction in fp32 (mirrors DSV4's pattern). bf16
   accumulation drifts logits by ~0.5/elem and flips top-k token picks
   toward high-baseline-energy junk tokens.
2. **qk_norm under JANGTQ P18 QKV fusion**. JANGTQ's QKV-fusion patch
   replaces the attention `__call__`; `Hy3Attention` declares
   `use_qk_norm=True` and uses `Hy3HeadRMSNorm` to auto-reshape flat
   `[B, L, n_heads * head_dim]` input to per-head shape so RMSNorm
   normalizes over `head_dim`, not over the entire flat dimension.

Decode ~15 tok/s greedy on M5 Max 128 GB at `reasoning_effort=no_think`.

## Reasoning + tools

- **Reasoning parser:** `qwen3` (extracts `<think>...</think>` blocks)
- **Tool parser:** `hunyuan` (Tencent XML-like:
  `<tool_calls><tool_call>name<tool_sep><arg_key>k</arg_key><arg_value>v</arg_value></tool_call></tool_calls>`)
- **Reasoning effort:** `no_think` (default) | `low` | `high` — pass via
  `apply_chat_template(..., reasoning_effort="…")`
- **Default rendering:** template emits a closed `<think></think>` for
  `no_think` mode; the runtime should NOT auto-open a reasoning prefix
  unless `low` or `high` is explicitly requested
- **Cache:** `kv` (standard GQA cache; no MLA, no SSM, no sliding-window)

## Top-K runtime override

`JANGTQ_TOPK_OVERRIDE=4 python serve.py` lowers per-token expert count
from the trained 8 to 4 for ~10% decode speedup. Coherence holds on
short prompts in our smoke tests; long-form quality is not benchmarked.
The patcher refuses to set K above the trained value and logs the
attribute count it modified.

## Credits

- **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
- **Source model:** Tencent Hy3-preview team
- **License:** [Tencent Hy Community License](LICENSE) — non-commercial, EU/UK/SK
  excluded; consult the LICENSE for full terms

## Validated runtime contract

- 80 layers materialize; 79 routed-expert SwitchGLU instances hydrate via
  TurboQuantLinear (2-bit MXTQ).
- Capabilities verify: `family=hy_v3`, `reasoning_parser=qwen3`,
  `tool_parser=hunyuan`, `think_in_template=False`, `supports_thinking=True`,
  `cache_type=kv`, `modality=text`.
- Coherence smoke (M5 Max 128 GB):
  - "What is 2 + 2?" → `4<|hy_eos|>` (15.2 tok/s)
  - "The capital of France is" → top-1 ` Paris` (logit 19.13)
  - "def fibonacci(n):" → top-1 `\n`, top-3 includes ` return`
- Hard-prompt benchmark coverage (HumanEval, MMLU, long-context) is
  pending. This bundle is shipped on smoke evidence; treat results
  beyond short prompts as preview-quality until benchmarks land.

## Runtime support matrix

| Surface | Status |
|---|---|
| `jang-tools` Python (`load_jangtq_model`) | ✅ working — this README's load snippet |
| `vmlx-swift-lm` Swift | ✅ working — `Libraries/MLXLLM/Models/Hy3.swift` + JANGTQ codebook dispatch. Same family path that ships ZAYA and Bailing/Ling. |
| `vmlx_engine` Python via re-export | pending — `vmlx_engine.loaders.load_jangtq_hy3` re-export of `jang_tools.hy3.runtime.load_hy3_model` not yet wired |
| MTP speculative decode | preserved-disabled — weights present in bundle, accept/reject loop not yet implemented in any JANG runtime |