Hy3-preview-JANGTQ_K

Tencent Hy3-preview — 102 GB on disk (down from ~557 GB BF16 source) — mixed-bit JANGTQ_K quantization on routed experts + 8-bit affine elsewhere. ~30 % bigger than Hy3-preview-JANGTQ (2-bit on routed experts) in exchange for a measurable quality bump on down_proj sensitivity, especially on long-output generation.

  • Source: tencent/Hy3-preview (Hy3 architecture, 295 B total / 21 B active, BF16 native, 256 K context, 80 transformer layers + 1 MTP, 192 routed experts top-8 + 1 shared)
  • Quantization: mixed-bit MXTQ on routed experts:
    • down_proj: 4-bit (4096-out, residual-stream sensitive)
    • gate_proj: 2-bit (gated by SwiGLU)
    • up_proj: 2-bit (multiplied with gate)
    • attention / shared expert / dense layer-0 / embed / lm_head / MTP matmuls: 8-bit affine
    • RMSNorms / router gate / expert_bias: fp16 / fp32 passthrough
  • MTP: layer 80 weights preserved (mtp_mode=preserved_disabled); decode is one-token-per-forward until accept/reject speculative loop ships.
  • Bundle size: 102 GB on-disk across 109 shards
  • Runs on: M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+

What's in the bundle

Module Source dtype Bundle dtype
Routed experts (192 × 3 mats × 79 sparse layers, per-expert layout) BF16 JANGTQ_K: down 4-bit, gate/up 2-bit
Attention q/k/v/o + q/k norms BF16 8-bit affine g=64
Shared expert (gate/up/down) BF16 8-bit affine g=64
Dense layer-0 MLP BF16 8-bit affine g=64
embed_tokens / lm_head BF16 8-bit affine g=64
MTP layer matmuls BF16 8-bit affine g=64 (preserved_disabled)
RMSNorms / router.gate.weight / expert_bias BF16 / F32 fp16 passthrough

jangtq_runtime.safetensors sidecar (~22 KB) for Swift runtimes — covers (in=1536, bits=4) + (in=4096, bits=2) codebooks + sign-flip vectors (Hy3 routed projections have asymmetric [4096↔1536] dims).

Why mixed-bit?

Hy3 is top-8 routing, so JANGTQ (uniform 2-bit) already averages codebook noise across 8 experts per token and ships coherent. JANGTQ_K spends extra bits on down_proj — the projection whose output enters the residual stream — to give long-output generation more headroom before residual noise compounds. Same scheme that ZAYA1-8B-JANGTQ_K ships for a strictly harder top-1 routing setup.

Loading (Python)

pip install jang-tools mlx-lm
from jang_tools.load_jangtq import load_jangtq_model

model, tokenizer = load_jangtq_model("JANGQ-AI/Hy3-preview-JANGTQ_K")

chat = tokenizer.apply_chat_template(
    [{{"role": "user", "content": "What is 2 + 2? Answer briefly."}}],
    tokenize=False,
    add_generation_prompt=True,
    reasoning_effort="no_think",
)

load_jangtq_model auto-registers model_type=hy_v3 via jang_tools.hy3 before building the MLX skeleton. The loader applies the standard SwitchGLU fused gate+up + P15 router compile + P18 QKV fusion patches automatically.

Reasoning + tools

  • Reasoning parser: qwen3 (extracts <think>...</think> blocks)
  • Tool parser: hunyuan (Tencent XML-like: <tool_calls><tool_call>name<tool_sep><arg_key>k</arg_key><arg_value>v</arg_value></tool_call></tool_calls>)
  • Reasoning effort: no_think (default) | low | high — pass via apply_chat_template(..., reasoning_effort="…")
  • Cache: kv (standard GQA cache)

Runtime support matrix

Surface Status
jang-tools Python (load_jangtq_model) ✅ working — this README's load snippet
vmlx-swift-lm Swift ✅ working — Libraries/MLXLLM/Models/Hy3.swift + JANGTQ dispatch. Same family path that ships ZAYA and Bailing/Ling.
vmlx_engine Python re-export pending
MTP speculative decode preserved-disabled — weights present in bundle, accept/reject loop not yet implemented

Credits

  • Quantization + MLX runtime: Jinho Jang (eric@jangq.ai)
  • Source model: Tencent Hy3-preview team
  • License: Tencent Hy Community License — non-commercial, EU/UK/SK excluded; consult the LICENSE for full terms
Downloads last month
-
Safetensors
Model size
28B params
Tensor type
U32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/Hy3-preview-JANGTQ_K

Quantized
(10)
this model