
Hy3-preview-JANGTQ
Tencent Hy3-preview — 79 GB on disk (down from the ~557 GB BF16 source) — 2-bit JANGTQ quantization on routed experts + 8-bit affine elsewhere.
- Source: tencent/Hy3-preview (Hy3 architecture, 295B total / 21B active, BF16 native, 256K context, 80 transformer layers + 1 MTP, 192 routed experts top-8 + 1 shared)
- Quantization: JANGTQ — 2-bit MXTQ codebook (Hadamard-rotated,
Lloyd-Max optimized) on routed-expert weights + 8-bit affine on
attention / shared expert / dense layer-0 / embed / lm_head / MTP
matmuls + fp16 passthrough on RMSNorms / router gate /
expert_bias - MTP: layer 80 weights preserved (
mtp_mode=preserved_disabled); decode is one-token-per-forward until accept/reject speculative loop ships - Bundle size: 79 GB on-disk across 85 shards
- Runs on: M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+
What's in the bundle
| Module | Source dtype | Bundle dtype |
|---|---|---|
| Routed experts (192 × 3 mats × 79 sparse layers, per-expert layout) | BF16 | 2-bit MXTQ + sidecar codebook |
| Attention q/k/v/o + q/k norms | BF16 | 8-bit affine g=64 |
| Shared expert (gate/up/down) | BF16 | 8-bit affine g=64 |
| Dense layer-0 MLP | BF16 | 8-bit affine g=64 |
embed_tokens / lm_head |
BF16 | 8-bit affine g=64 |
| MTP layer matmuls | BF16 | 8-bit affine g=64 (preserved_disabled) |
RMSNorms / router.gate.weight / expert_bias |
BF16 / F32 | fp16 passthrough |
jangtq_runtime.safetensors sidecar (~22 KB) for Swift runtimes — covers
(in_features={1536, 4096}, seed=42, bits=2) codebooks + sign-flip vectors.
Loading (Python)
pip install jang-tools mlx-lm
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model("JANGQ-AI/Hy3-preview-JANGTQ")
chat = tokenizer.apply_chat_template(
[{"role": "user", "content": "What is 2 + 2? Answer briefly."}],
tokenize=False,
add_generation_prompt=True,
reasoning_effort="no_think",
)
load_jangtq_model auto-registers model_type=hy_v3 via
jang_tools.hy3 before building the MLX skeleton. The loader applies
the standard SwitchGLU fused gate+up + P15 router compile + P18 QKV
fusion patches automatically. Two Hy3-specific runtime fixes are baked
in:
- fp32 lm_head.
enable_lm_head_fp32=Truein the bundle config —Model.__call__dequantizes the quantized lm_head and accumulates the 4096-dim contraction in fp32 (mirrors DSV4's pattern). bf16 accumulation drifts logits by ~0.5/elem and flips top-k token picks toward high-baseline-energy junk tokens. - qk_norm under JANGTQ P18 QKV fusion. JANGTQ's QKV-fusion patch
replaces the attention
__call__;Hy3Attentiondeclaresuse_qk_norm=Trueand usesHy3HeadRMSNormto auto-reshape flat[B, L, n_heads * head_dim]input to per-head shape so RMSNorm normalizes overhead_dim, not over the entire flat dimension.
Decode ~15 tok/s greedy on M5 Max 128 GB at reasoning_effort=no_think.
Reasoning + tools
- Reasoning parser:
qwen3(extracts<think>...</think>blocks) - Tool parser:
hunyuan(Tencent XML-like:<tool_calls><tool_call>name<tool_sep><arg_key>k</arg_key><arg_value>v</arg_value></tool_call></tool_calls>) - Reasoning effort:
no_think(default) |low|high— pass viaapply_chat_template(..., reasoning_effort="…") - Default rendering: template emits a closed
<think></think>forno_thinkmode; the runtime should NOT auto-open a reasoning prefix unlessloworhighis explicitly requested - Cache:
kv(standard GQA cache; no MLA, no SSM, no sliding-window)
Top-K runtime override
JANGTQ_TOPK_OVERRIDE=4 python serve.py lowers per-token expert count
from the trained 8 to 4 for ~10% decode speedup. Coherence holds on
short prompts in our smoke tests; long-form quality is not benchmarked.
The patcher refuses to set K above the trained value and logs the
attribute count it modified.
Credits
- Quantization + MLX runtime: Jinho Jang (eric@jangq.ai)
- Source model: Tencent Hy3-preview team
- License: Tencent Hy Community License — non-commercial, EU/UK/SK excluded; consult the LICENSE for full terms
Validated runtime contract
- 80 layers materialize; 79 routed-expert SwitchGLU instances hydrate via TurboQuantLinear (2-bit MXTQ).
- Capabilities verify:
family=hy_v3,reasoning_parser=qwen3,tool_parser=hunyuan,think_in_template=False,supports_thinking=True,cache_type=kv,modality=text. - Coherence smoke (M5 Max 128 GB):
- "What is 2 + 2?" →
4<|hy_eos|>(15.2 tok/s) - "The capital of France is" → top-1
Paris(logit 19.13) - "def fibonacci(n):" → top-1
\n, top-3 includesreturn
- "What is 2 + 2?" →
- Hard-prompt benchmark coverage (HumanEval, MMLU, long-context) is pending. This bundle is shipped on smoke evidence; treat results beyond short prompts as preview-quality until benchmarks land.
Runtime support matrix
| Surface | Status |
|---|---|
jang-tools Python (load_jangtq_model) |
✅ working — this README's load snippet |
vmlx-swift-lm Swift |
✅ working — Libraries/MLXLLM/Models/Hy3.swift + JANGTQ codebook dispatch. Same family path that ships ZAYA and Bailing/Ling. |
vmlx_engine Python via re-export |
pending — vmlx_engine.loaders.load_jangtq_hy3 re-export of jang_tools.hy3.runtime.load_hy3_model not yet wired |
| MTP speculative decode | preserved-disabled — weights present in bundle, accept/reject loop not yet implemented in any JANG runtime |
- Downloads last month
- -
Quantized
Model tree for JANGQ-AI/Hy3-preview-JANGTQ
Base model
tencent/Hy3-preview