Instructions to use OsaurusAI/Hy3-preview-JANGTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OsaurusAI/Hy3-preview-JANGTQ with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("OsaurusAI/Hy3-preview-JANGTQ")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use OsaurusAI/Hy3-preview-JANGTQ with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "OsaurusAI/Hy3-preview-JANGTQ"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "OsaurusAI/Hy3-preview-JANGTQ"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use OsaurusAI/Hy3-preview-JANGTQ with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "OsaurusAI/Hy3-preview-JANGTQ"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default OsaurusAI/Hy3-preview-JANGTQ

Run Hermes

hermes

MLX LM

How to use OsaurusAI/Hy3-preview-JANGTQ with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "OsaurusAI/Hy3-preview-JANGTQ"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "OsaurusAI/Hy3-preview-JANGTQ"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "OsaurusAI/Hy3-preview-JANGTQ",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Osaurus-AI commited on 1 day ago

Commit

f8c5df9

verified ·

1 Parent(s): c2d6f85

Add OsaurusAI README + banner

Browse files

Files changed (1) hide show

README.md +145 -0

README.md ADDED Viewed

	@@ -0,0 +1,145 @@

+---
+license: other
+license_name: tencent-hy-community
+license_link: LICENSE
+library_name: mlx
+tags:
+  - mlx
+  - jang
+  - jangtq
+  - hy3
+  - hunyuan
+  - hy_v3
+  - moe
+  - apple-silicon
+  - 2bit
+  - 295b
+  - osaurus
+pipeline_tag: text-generation
+base_model: tencent/Hy3-preview
+base_model_relation: quantized
+---
+<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>
+# Hy3-preview-JANGTQ
+**Tencent Hy3-preview — 79 GB on disk** (down from the ~557 GB BF16 source) —
+2-bit **JANGTQ** quantization on routed experts + 8-bit affine elsewhere.
+- **Source:** [tencent/Hy3-preview](https://huggingface.co/tencent/Hy3-preview)
+  (Hy3 architecture, 295B total / 21B active, BF16 native, 256K context,
+  80 transformer layers + 1 MTP, 192 routed experts top-8 + 1 shared)
+- **Quantization:** **JANGTQ** — 2-bit MXTQ codebook (Hadamard-rotated,
+  Lloyd-Max optimized) on routed-expert weights + 8-bit affine on
+  attention / shared expert / dense layer-0 / embed / lm_head / MTP
+  matmuls + fp16 passthrough on RMSNorms / router gate / `expert_bias`
+- **MTP:** layer 80 weights preserved (`mtp_mode=preserved_disabled`);
+  decode is one-token-per-forward until accept/reject speculative loop
+  ships
+- **Bundle size:** **79 GB on-disk** across 85 shards
+- **Runs on:** M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+
+## What's in the bundle
+| Module | Source dtype | Bundle dtype |
+|---|---|---|
+| Routed experts (192 × 3 mats × 79 sparse layers, per-expert layout) | BF16 | **2-bit MXTQ** + sidecar codebook |
+| Attention q/k/v/o + q/k norms | BF16 | 8-bit affine g=64 |
+| Shared expert (gate/up/down) | BF16 | 8-bit affine g=64 |
+| Dense layer-0 MLP | BF16 | 8-bit affine g=64 |
+| `embed_tokens` / `lm_head` | BF16 | 8-bit affine g=64 |
+| MTP layer matmuls | BF16 | 8-bit affine g=64 (preserved_disabled) |
+| RMSNorms / `router.gate.weight` / `expert_bias` | BF16 / F32 | fp16 passthrough |
+`jangtq_runtime.safetensors` sidecar (~22 KB) for Swift runtimes — covers
+`(in_features={1536, 4096}, seed=42, bits=2)` codebooks + sign-flip vectors.
+## Loading (Python)
+```bash
+pip install jang-tools mlx-lm
+```
+```python
+from jang_tools.load_jangtq import load_jangtq_model
+model, tokenizer = load_jangtq_model("OsaurusAI/Hy3-preview-JANGTQ")
+chat = tokenizer.apply_chat_template(
+    [{"role": "user", "content": "What is 2 + 2? Answer briefly."}],
+    tokenize=False,
+    add_generation_prompt=True,
+    reasoning_effort="no_think",
+)
+```
+`load_jangtq_model` auto-registers `model_type=hy_v3` via
+`jang_tools.hy3` before building the MLX skeleton. The loader applies
+the standard SwitchGLU fused gate+up + P15 router compile + P18 QKV
+fusion patches automatically. Two Hy3-specific runtime fixes are baked
+in:
+1. **fp32 lm_head**. `enable_lm_head_fp32=True` in the bundle config —
+   `Model.__call__` dequantizes the quantized lm_head and accumulates
+   the 4096-dim contraction in fp32 (mirrors DSV4's pattern). bf16
+   accumulation drifts logits by ~0.5/elem and flips top-k token picks
+   toward high-baseline-energy junk tokens.
+2. **qk_norm under JANGTQ P18 QKV fusion**. JANGTQ's QKV-fusion patch
+   replaces the attention `__call__`; `Hy3Attention` declares
+   `use_qk_norm=True` and uses `Hy3HeadRMSNorm` to auto-reshape flat
+   `[B, L, n_heads * head_dim]` input to per-head shape so RMSNorm
+   normalizes over `head_dim`, not over the entire flat dimension.
+Decode ~15 tok/s greedy on M5 Max 128 GB at `reasoning_effort=no_think`.
+## Reasoning + tools
+- **Reasoning parser:** `qwen3` (extracts `<think>...</think>` blocks)
+- **Tool parser:** `hunyuan` (Tencent XML-like:
+  `<tool_calls><tool_call>name<tool_sep><arg_key>k</arg_key><arg_value>v</arg_value></tool_call></tool_calls>`)
+- **Reasoning effort:** `no_think` (default) | `low` | `high` — pass via
+  `apply_chat_template(..., reasoning_effort="…")`
+- **Default rendering:** template emits a closed `<think></think>` for
+  `no_think` mode; the runtime should NOT auto-open a reasoning prefix
+  unless `low` or `high` is explicitly requested
+- **Cache:** `kv` (standard GQA cache; no MLA, no SSM, no sliding-window)
+## Top-K runtime override
+`JANGTQ_TOPK_OVERRIDE=4 python serve.py` lowers per-token expert count
+from the trained 8 to 4 for ~10% decode speedup. Coherence holds on
+short prompts in our smoke tests; long-form quality is not benchmarked.
+The patcher refuses to set K above the trained value and logs the
+attribute count it modified.
+## Credits
+- **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
+- **Source model:** Tencent Hy3-preview team
+- **License:** [Tencent Hy Community License](LICENSE) — non-commercial, EU/UK/SK
+  excluded; consult the LICENSE for full terms
+## Validated runtime contract
+- 80 layers materialize; 79 routed-expert SwitchGLU instances hydrate via
+  TurboQuantLinear (2-bit MXTQ).
+- Capabilities verify: `family=hy_v3`, `reasoning_parser=qwen3`,
+  `tool_parser=hunyuan`, `think_in_template=False`, `supports_thinking=True`,
+  `cache_type=kv`, `modality=text`.
+- Coherence smoke (M5 Max 128 GB):
+  - "What is 2 + 2?" → `4<｜hy_eos｜>` (15.2 tok/s)
+  - "The capital of France is" → top-1 ` Paris` (logit 19.13)
+  - "def fibonacci(n):" → top-1 `\n`, top-3 includes ` return`
+- Hard-prompt benchmark coverage (HumanEval, MMLU, long-context) is
+  pending. This bundle is shipped on smoke evidence; treat results
+  beyond short prompts as preview-quality until benchmarks land.
+## Runtime support matrix
+| Surface | Status |
+|---|---|
+| `jang-tools` Python (`load_jangtq_model`) | ✅ working — this README's load snippet |
+| `vmlx_engine` Python via re-export | pending — `vmlx_engine.loaders.load_jangtq_hy3` should re-export `jang_tools.hy3.runtime.load_hy3_model` |
+| `vmlx-swift-lm` Swift | ❌ pending — `LLMModelFactory.dispatchHy3Unsupported` currently throws; needs new `Hy3.swift` model class + JANGTQ Swift dispatch |
+| MTP speculative decode | preserved-disabled — weights present in bundle, accept/reject loop not yet implemented in any JANG runtime |