Instructions to use OsaurusAI/Hy3-preview-JANGTQ_K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OsaurusAI/Hy3-preview-JANGTQ_K with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("OsaurusAI/Hy3-preview-JANGTQ_K")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use OsaurusAI/Hy3-preview-JANGTQ_K with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "OsaurusAI/Hy3-preview-JANGTQ_K"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "OsaurusAI/Hy3-preview-JANGTQ_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use OsaurusAI/Hy3-preview-JANGTQ_K with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "OsaurusAI/Hy3-preview-JANGTQ_K"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default OsaurusAI/Hy3-preview-JANGTQ_K

Run Hermes

hermes

MLX LM

How to use OsaurusAI/Hy3-preview-JANGTQ_K with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "OsaurusAI/Hy3-preview-JANGTQ_K"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "OsaurusAI/Hy3-preview-JANGTQ_K"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "OsaurusAI/Hy3-preview-JANGTQ_K",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Hy3-preview-JANGTQ_K / README.md

Osaurus-AI

Initial JANGTQ_K release: mixed-bit (down=4, gate=2, up=2) routed experts

9c82a0e verified about 23 hours ago

preview code

raw

history blame contribute delete

4.64 kB

	---
	license: other
	license_name: tencent-hy-community
	license_link: LICENSE
	library_name: mlx
	tags:
	- mlx
	- jang
	- jangtq
	- jangtq-k
	- mixed-precision
	- hy3
	- hunyuan
	- hy_v3
	- moe
	- apple-silicon
	- 295b
	- osaurus
	pipeline_tag: text-generation
	base_model: tencent/Hy3-preview
	base_model_relation: quantized
	---

	<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>

	# Hy3-preview-JANGTQ_K

	Tencent Hy3-preview — 102 GB on disk (down from ~557 GB BF16 source) —
	mixed-bit JANGTQ_K quantization on routed experts + 8-bit affine
	elsewhere. ~30 % bigger than `Hy3-preview-JANGTQ` (2-bit on routed
	experts) in exchange for a measurable quality bump on `down_proj`
	sensitivity, especially on long-output generation.

	- Source: [tencent/Hy3-preview](https://huggingface.co/tencent/Hy3-preview)
	(Hy3 architecture, 295 B total / 21 B active, BF16 native, 256 K
	context, 80 transformer layers + 1 MTP, 192 routed experts top-8 + 1
	shared)
	- Quantization: mixed-bit MXTQ on routed experts:
	- `down_proj`: 4-bit (4096-out, residual-stream sensitive)
	- `gate_proj`: 2-bit (gated by SwiGLU)
	- `up_proj`: 2-bit (multiplied with gate)
	- attention / shared expert / dense layer-0 / embed / lm_head / MTP
	matmuls: 8-bit affine
	- RMSNorms / router gate / `expert_bias`: fp16 / fp32 passthrough
	- MTP: layer 80 weights preserved (`mtp_mode=preserved_disabled`);
	decode is one-token-per-forward until accept/reject speculative loop
	ships.
	- Bundle size: 102 GB on-disk across 109 shards
	- Runs on: M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+

	## What's in the bundle

	\| Module \| Source dtype \| Bundle dtype \|
	\|---\|---\|---\|
	\| Routed experts (192 × 3 mats × 79 sparse layers, per-expert layout) \| BF16 \| JANGTQ_K: down 4-bit, gate/up 2-bit \|
	\| Attention q/k/v/o + q/k norms \| BF16 \| 8-bit affine g=64 \|
	\| Shared expert (gate/up/down) \| BF16 \| 8-bit affine g=64 \|
	\| Dense layer-0 MLP \| BF16 \| 8-bit affine g=64 \|
	\| `embed_tokens` / `lm_head` \| BF16 \| 8-bit affine g=64 \|
	\| MTP layer matmuls \| BF16 \| 8-bit affine g=64 (preserved_disabled) \|
	\| RMSNorms / `router.gate.weight` / `expert_bias` \| BF16 / F32 \| fp16 passthrough \|

	`jangtq_runtime.safetensors` sidecar (~22 KB) for Swift runtimes —
	covers `(in=1536, bits=4)` + `(in=4096, bits=2)` codebooks + sign-flip
	vectors (Hy3 routed projections have asymmetric `[4096↔1536]` dims).

	## Why mixed-bit?

	Hy3 is top-8 routing, so `JANGTQ` (uniform 2-bit) already averages
	codebook noise across 8 experts per token and ships coherent. `JANGTQ_K`
	spends extra bits on `down_proj` — the projection whose output enters
	the residual stream — to give long-output generation more headroom
	before residual noise compounds. Same scheme that ZAYA1-8B-JANGTQ_K
	ships for a strictly harder top-1 routing setup.

	## Loading (Python)

	```bash
	pip install jang-tools mlx-lm
	```

	```python
	from jang_tools.load_jangtq import load_jangtq_model

	model, tokenizer = load_jangtq_model("OsaurusAI/Hy3-preview-JANGTQ_K")

	chat = tokenizer.apply_chat_template(
	[{{"role": "user", "content": "What is 2 + 2? Answer briefly."}}],
	tokenize=False,
	add_generation_prompt=True,
	reasoning_effort="no_think",
	)
	```

	`load_jangtq_model` auto-registers `model_type=hy_v3` via
	`jang_tools.hy3` before building the MLX skeleton. The loader applies
	the standard SwitchGLU fused gate+up + P15 router compile + P18 QKV
	fusion patches automatically.

	## Reasoning + tools

	- Reasoning parser: `qwen3` (extracts `<think>...</think>` blocks)
	- Tool parser: `hunyuan` (Tencent XML-like:
	`<tool_calls><tool_call>name<tool_sep><arg_key>k</arg_key><arg_value>v</arg_value></tool_call></tool_calls>`)
	- Reasoning effort: `no_think` (default) \| `low` \| `high` — pass via
	`apply_chat_template(..., reasoning_effort="…")`
	- Cache: `kv` (standard GQA cache)

	## Runtime support matrix

	\| Surface \| Status \|
	\|---\|---\|
	\| `jang-tools` Python (`load_jangtq_model`) \| ✅ working — this README's load snippet \|
	\| `vmlx-swift-lm` Swift \| ✅ working — `Libraries/MLXLLM/Models/Hy3.swift` + JANGTQ dispatch. Same family path that ships ZAYA and Bailing/Ling. \|
	\| `vmlx_engine` Python re-export \| pending \|
	\| MTP speculative decode \| preserved-disabled — weights present in bundle, accept/reject loop not yet implemented \|

	## Credits

	- Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
	- Source model: Tencent Hy3-preview team
	- License: [Tencent Hy Community License](LICENSE) — non-commercial, EU/UK/SK
	excluded; consult the LICENSE for full terms