
Hy3-preview-JANGTQ_K
Tencent Hy3-preview — 102 GB on disk (down from ~557 GB BF16 source) —
mixed-bit JANGTQ_K quantization on routed experts + 8-bit affine
elsewhere. ~30 % bigger than Hy3-preview-JANGTQ (2-bit on routed
experts) in exchange for a measurable quality bump on down_proj
sensitivity, especially on long-output generation.
- Source: tencent/Hy3-preview (Hy3 architecture, 295 B total / 21 B active, BF16 native, 256 K context, 80 transformer layers + 1 MTP, 192 routed experts top-8 + 1 shared)
- Quantization: mixed-bit MXTQ on routed experts:
down_proj: 4-bit (4096-out, residual-stream sensitive)gate_proj: 2-bit (gated by SwiGLU)up_proj: 2-bit (multiplied with gate)- attention / shared expert / dense layer-0 / embed / lm_head / MTP matmuls: 8-bit affine
- RMSNorms / router gate /
expert_bias: fp16 / fp32 passthrough
- MTP: layer 80 weights preserved (
mtp_mode=preserved_disabled); decode is one-token-per-forward until accept/reject speculative loop ships. - Bundle size: 102 GB on-disk across 109 shards
- Runs on: M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+
What's in the bundle
| Module | Source dtype | Bundle dtype |
|---|---|---|
| Routed experts (192 × 3 mats × 79 sparse layers, per-expert layout) | BF16 | JANGTQ_K: down 4-bit, gate/up 2-bit |
| Attention q/k/v/o + q/k norms | BF16 | 8-bit affine g=64 |
| Shared expert (gate/up/down) | BF16 | 8-bit affine g=64 |
| Dense layer-0 MLP | BF16 | 8-bit affine g=64 |
embed_tokens / lm_head |
BF16 | 8-bit affine g=64 |
| MTP layer matmuls | BF16 | 8-bit affine g=64 (preserved_disabled) |
RMSNorms / router.gate.weight / expert_bias |
BF16 / F32 | fp16 passthrough |
jangtq_runtime.safetensors sidecar (~22 KB) for Swift runtimes —
covers (in=1536, bits=4) + (in=4096, bits=2) codebooks + sign-flip
vectors (Hy3 routed projections have asymmetric [4096↔1536] dims).
Why mixed-bit?
Hy3 is top-8 routing, so JANGTQ (uniform 2-bit) already averages
codebook noise across 8 experts per token and ships coherent. JANGTQ_K
spends extra bits on down_proj — the projection whose output enters
the residual stream — to give long-output generation more headroom
before residual noise compounds. Same scheme that ZAYA1-8B-JANGTQ_K
ships for a strictly harder top-1 routing setup.
Loading (Python)
pip install jang-tools mlx-lm
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model("JANGQ-AI/Hy3-preview-JANGTQ_K")
chat = tokenizer.apply_chat_template(
[{{"role": "user", "content": "What is 2 + 2? Answer briefly."}}],
tokenize=False,
add_generation_prompt=True,
reasoning_effort="no_think",
)
load_jangtq_model auto-registers model_type=hy_v3 via
jang_tools.hy3 before building the MLX skeleton. The loader applies
the standard SwitchGLU fused gate+up + P15 router compile + P18 QKV
fusion patches automatically.
Reasoning + tools
- Reasoning parser:
qwen3(extracts<think>...</think>blocks) - Tool parser:
hunyuan(Tencent XML-like:<tool_calls><tool_call>name<tool_sep><arg_key>k</arg_key><arg_value>v</arg_value></tool_call></tool_calls>) - Reasoning effort:
no_think(default) |low|high— pass viaapply_chat_template(..., reasoning_effort="…") - Cache:
kv(standard GQA cache)
Runtime support matrix
| Surface | Status |
|---|---|
jang-tools Python (load_jangtq_model) |
✅ working — this README's load snippet |
vmlx-swift-lm Swift |
✅ working — Libraries/MLXLLM/Models/Hy3.swift + JANGTQ dispatch. Same family path that ships ZAYA and Bailing/Ling. |
vmlx_engine Python re-export |
pending |
| MTP speculative decode | preserved-disabled — weights present in bundle, accept/reject loop not yet implemented |
Credits
- Quantization + MLX runtime: Jinho Jang (eric@jangq.ai)
- Source model: Tencent Hy3-preview team
- License: Tencent Hy Community License — non-commercial, EU/UK/SK excluded; consult the LICENSE for full terms
- Downloads last month
- -
Quantized
Model tree for JANGQ-AI/Hy3-preview-JANGTQ_K
Base model
tencent/Hy3-preview