Instructions to use JANGQ-AI/Hy3-preview-JANGTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use JANGQ-AI/Hy3-preview-JANGTQ with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("JANGQ-AI/Hy3-preview-JANGTQ")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use JANGQ-AI/Hy3-preview-JANGTQ with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "JANGQ-AI/Hy3-preview-JANGTQ"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "JANGQ-AI/Hy3-preview-JANGTQ"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use JANGQ-AI/Hy3-preview-JANGTQ with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "JANGQ-AI/Hy3-preview-JANGTQ"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default JANGQ-AI/Hy3-preview-JANGTQ

Run Hermes

hermes

MLX LM

How to use JANGQ-AI/Hy3-preview-JANGTQ with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "JANGQ-AI/Hy3-preview-JANGTQ"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "JANGQ-AI/Hy3-preview-JANGTQ"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "JANGQ-AI/Hy3-preview-JANGTQ",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Hy3-preview-JANGTQ

Tencent Hy3-preview — 79 GB on disk (down from the ~557 GB BF16 source) — 2-bit JANGTQ quantization on routed experts + 8-bit affine elsewhere.

Source: tencent/Hy3-preview (Hy3 architecture, 295B total / 21B active, BF16 native, 256K context, 80 transformer layers + 1 MTP, 192 routed experts top-8 + 1 shared)
Quantization: JANGTQ — 2-bit MXTQ codebook (Hadamard-rotated, Lloyd-Max optimized) on routed-expert weights + 8-bit affine on attention / shared expert / dense layer-0 / embed / lm_head / MTP matmuls + fp16 passthrough on RMSNorms / router gate / expert_bias
MTP: layer 80 weights preserved (mtp_mode=preserved_disabled); decode is one-token-per-forward until accept/reject speculative loop ships
Bundle size: 79 GB on-disk across 85 shards
Runs on: M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+

What's in the bundle

Module	Source dtype	Bundle dtype
Routed experts (192 × 3 mats × 79 sparse layers, per-expert layout)	BF16	2-bit MXTQ + sidecar codebook
Attention q/k/v/o + q/k norms	BF16	8-bit affine g=64
Shared expert (gate/up/down)	BF16	8-bit affine g=64
Dense layer-0 MLP	BF16	8-bit affine g=64
`embed_tokens` / `lm_head`	BF16	8-bit affine g=64
MTP layer matmuls	BF16	8-bit affine g=64 (preserved_disabled)
RMSNorms / `router.gate.weight` / `expert_bias`	BF16 / F32	fp16 passthrough

jangtq_runtime.safetensors sidecar (~22 KB) for Swift runtimes — covers (in_features={1536, 4096}, seed=42, bits=2) codebooks + sign-flip vectors.

Loading (Python)

pip install jang-tools mlx-lm

from jang_tools.load_jangtq import load_jangtq_model

model, tokenizer = load_jangtq_model("JANGQ-AI/Hy3-preview-JANGTQ")

chat = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is 2 + 2? Answer briefly."}],
    tokenize=False,
    add_generation_prompt=True,
    reasoning_effort="no_think",
)

load_jangtq_model auto-registers model_type=hy_v3 via jang_tools.hy3 before building the MLX skeleton. The loader applies the standard SwitchGLU fused gate+up + P15 router compile + P18 QKV fusion patches automatically. Two Hy3-specific runtime fixes are baked in:

fp32 lm_head. enable_lm_head_fp32=True in the bundle config — Model.__call__ dequantizes the quantized lm_head and accumulates the 4096-dim contraction in fp32 (mirrors DSV4's pattern). bf16 accumulation drifts logits by ~0.5/elem and flips top-k token picks toward high-baseline-energy junk tokens.
qk_norm under JANGTQ P18 QKV fusion. JANGTQ's QKV-fusion patch replaces the attention __call__; Hy3Attention declares use_qk_norm=True and uses Hy3HeadRMSNorm to auto-reshape flat [B, L, n_heads * head_dim] input to per-head shape so RMSNorm normalizes over head_dim, not over the entire flat dimension.

Decode ~15 tok/s greedy on M5 Max 128 GB at reasoning_effort=no_think.

Reasoning + tools

Reasoning parser: qwen3 (extracts <think>...</think> blocks)
Tool parser: hunyuan (Tencent XML-like: <tool_calls><tool_call>name<tool_sep><arg_key>k</arg_key><arg_value>v</arg_value></tool_call></tool_calls>)
Reasoning effort: no_think (default) | low | high — pass via apply_chat_template(..., reasoning_effort="…")
Default rendering: template emits a closed <think></think> for no_think mode; the runtime should NOT auto-open a reasoning prefix unless low or high is explicitly requested
Cache: kv (standard GQA cache; no MLA, no SSM, no sliding-window)

Top-K runtime override

JANGTQ_TOPK_OVERRIDE=4 python serve.py lowers per-token expert count from the trained 8 to 4 for ~10% decode speedup. Coherence holds on short prompts in our smoke tests; long-form quality is not benchmarked. The patcher refuses to set K above the trained value and logs the attribute count it modified.

Credits

Quantization + MLX runtime: Jinho Jang (eric@jangq.ai)
Source model: Tencent Hy3-preview team
License: Tencent Hy Community License — non-commercial, EU/UK/SK excluded; consult the LICENSE for full terms

Validated runtime contract

80 layers materialize; 79 routed-expert SwitchGLU instances hydrate via TurboQuantLinear (2-bit MXTQ).
Capabilities verify: family=hy_v3, reasoning_parser=qwen3, tool_parser=hunyuan, think_in_template=False, supports_thinking=True, cache_type=kv, modality=text.
Coherence smoke (M5 Max 128 GB):
- "What is 2 + 2?" → 4<｜hy_eos｜> (15.2 tok/s)
- "The capital of France is" → top-1 Paris (logit 19.13)
- "def fibonacci(n):" → top-1 \n, top-3 includes return
Hard-prompt benchmark coverage (HumanEval, MMLU, long-context) is pending. This bundle is shipped on smoke evidence; treat results beyond short prompts as preview-quality until benchmarks land.

Runtime support matrix

Surface	Status
`jang-tools` Python (`load_jangtq_model`)	✅ working — this README's load snippet
`vmlx-swift-lm` Swift	✅ working — `Libraries/MLXLLM/Models/Hy3.swift` + JANGTQ codebook dispatch. Same family path that ships ZAYA and Bailing/Ling.
`vmlx_engine` Python via re-export	pending — `vmlx_engine.loaders.load_jangtq_hy3` re-export of `jang_tools.hy3.runtime.load_hy3_model` not yet wired
MTP speculative decode	preserved-disabled — weights present in bundle, accept/reject loop not yet implemented in any JANG runtime

Downloads last month: -

Safetensors

Model size

22B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for JANGQ-AI/Hy3-preview-JANGTQ

Base model

tencent/Hy3-preview

Quantized

(10)

this model