Add OsaurusAI README + banner
Browse files
README.md
ADDED
|
@@ -0,0 +1,145 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: other
|
| 3 |
+
license_name: tencent-hy-community
|
| 4 |
+
license_link: LICENSE
|
| 5 |
+
library_name: mlx
|
| 6 |
+
tags:
|
| 7 |
+
- mlx
|
| 8 |
+
- jang
|
| 9 |
+
- jangtq
|
| 10 |
+
- hy3
|
| 11 |
+
- hunyuan
|
| 12 |
+
- hy_v3
|
| 13 |
+
- moe
|
| 14 |
+
- apple-silicon
|
| 15 |
+
- 2bit
|
| 16 |
+
- 295b
|
| 17 |
+
- osaurus
|
| 18 |
+
pipeline_tag: text-generation
|
| 19 |
+
base_model: tencent/Hy3-preview
|
| 20 |
+
base_model_relation: quantized
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>
|
| 24 |
+
|
| 25 |
+
# Hy3-preview-JANGTQ
|
| 26 |
+
|
| 27 |
+
**Tencent Hy3-preview — 79 GB on disk** (down from the ~557 GB BF16 source) —
|
| 28 |
+
2-bit **JANGTQ** quantization on routed experts + 8-bit affine elsewhere.
|
| 29 |
+
|
| 30 |
+
- **Source:** [tencent/Hy3-preview](https://huggingface.co/tencent/Hy3-preview)
|
| 31 |
+
(Hy3 architecture, 295B total / 21B active, BF16 native, 256K context,
|
| 32 |
+
80 transformer layers + 1 MTP, 192 routed experts top-8 + 1 shared)
|
| 33 |
+
- **Quantization:** **JANGTQ** — 2-bit MXTQ codebook (Hadamard-rotated,
|
| 34 |
+
Lloyd-Max optimized) on routed-expert weights + 8-bit affine on
|
| 35 |
+
attention / shared expert / dense layer-0 / embed / lm_head / MTP
|
| 36 |
+
matmuls + fp16 passthrough on RMSNorms / router gate / `expert_bias`
|
| 37 |
+
- **MTP:** layer 80 weights preserved (`mtp_mode=preserved_disabled`);
|
| 38 |
+
decode is one-token-per-forward until accept/reject speculative loop
|
| 39 |
+
ships
|
| 40 |
+
- **Bundle size:** **79 GB on-disk** across 85 shards
|
| 41 |
+
- **Runs on:** M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+
|
| 42 |
+
|
| 43 |
+
## What's in the bundle
|
| 44 |
+
|
| 45 |
+
| Module | Source dtype | Bundle dtype |
|
| 46 |
+
|---|---|---|
|
| 47 |
+
| Routed experts (192 × 3 mats × 79 sparse layers, per-expert layout) | BF16 | **2-bit MXTQ** + sidecar codebook |
|
| 48 |
+
| Attention q/k/v/o + q/k norms | BF16 | 8-bit affine g=64 |
|
| 49 |
+
| Shared expert (gate/up/down) | BF16 | 8-bit affine g=64 |
|
| 50 |
+
| Dense layer-0 MLP | BF16 | 8-bit affine g=64 |
|
| 51 |
+
| `embed_tokens` / `lm_head` | BF16 | 8-bit affine g=64 |
|
| 52 |
+
| MTP layer matmuls | BF16 | 8-bit affine g=64 (preserved_disabled) |
|
| 53 |
+
| RMSNorms / `router.gate.weight` / `expert_bias` | BF16 / F32 | fp16 passthrough |
|
| 54 |
+
|
| 55 |
+
`jangtq_runtime.safetensors` sidecar (~22 KB) for Swift runtimes — covers
|
| 56 |
+
`(in_features={1536, 4096}, seed=42, bits=2)` codebooks + sign-flip vectors.
|
| 57 |
+
|
| 58 |
+
## Loading (Python)
|
| 59 |
+
|
| 60 |
+
```bash
|
| 61 |
+
pip install jang-tools mlx-lm
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
```python
|
| 65 |
+
from jang_tools.load_jangtq import load_jangtq_model
|
| 66 |
+
|
| 67 |
+
model, tokenizer = load_jangtq_model("OsaurusAI/Hy3-preview-JANGTQ")
|
| 68 |
+
|
| 69 |
+
chat = tokenizer.apply_chat_template(
|
| 70 |
+
[{"role": "user", "content": "What is 2 + 2? Answer briefly."}],
|
| 71 |
+
tokenize=False,
|
| 72 |
+
add_generation_prompt=True,
|
| 73 |
+
reasoning_effort="no_think",
|
| 74 |
+
)
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
`load_jangtq_model` auto-registers `model_type=hy_v3` via
|
| 78 |
+
`jang_tools.hy3` before building the MLX skeleton. The loader applies
|
| 79 |
+
the standard SwitchGLU fused gate+up + P15 router compile + P18 QKV
|
| 80 |
+
fusion patches automatically. Two Hy3-specific runtime fixes are baked
|
| 81 |
+
in:
|
| 82 |
+
|
| 83 |
+
1. **fp32 lm_head**. `enable_lm_head_fp32=True` in the bundle config —
|
| 84 |
+
`Model.__call__` dequantizes the quantized lm_head and accumulates
|
| 85 |
+
the 4096-dim contraction in fp32 (mirrors DSV4's pattern). bf16
|
| 86 |
+
accumulation drifts logits by ~0.5/elem and flips top-k token picks
|
| 87 |
+
toward high-baseline-energy junk tokens.
|
| 88 |
+
2. **qk_norm under JANGTQ P18 QKV fusion**. JANGTQ's QKV-fusion patch
|
| 89 |
+
replaces the attention `__call__`; `Hy3Attention` declares
|
| 90 |
+
`use_qk_norm=True` and uses `Hy3HeadRMSNorm` to auto-reshape flat
|
| 91 |
+
`[B, L, n_heads * head_dim]` input to per-head shape so RMSNorm
|
| 92 |
+
normalizes over `head_dim`, not over the entire flat dimension.
|
| 93 |
+
|
| 94 |
+
Decode ~15 tok/s greedy on M5 Max 128 GB at `reasoning_effort=no_think`.
|
| 95 |
+
|
| 96 |
+
## Reasoning + tools
|
| 97 |
+
|
| 98 |
+
- **Reasoning parser:** `qwen3` (extracts `<think>...</think>` blocks)
|
| 99 |
+
- **Tool parser:** `hunyuan` (Tencent XML-like:
|
| 100 |
+
`<tool_calls><tool_call>name<tool_sep><arg_key>k</arg_key><arg_value>v</arg_value></tool_call></tool_calls>`)
|
| 101 |
+
- **Reasoning effort:** `no_think` (default) | `low` | `high` — pass via
|
| 102 |
+
`apply_chat_template(..., reasoning_effort="…")`
|
| 103 |
+
- **Default rendering:** template emits a closed `<think></think>` for
|
| 104 |
+
`no_think` mode; the runtime should NOT auto-open a reasoning prefix
|
| 105 |
+
unless `low` or `high` is explicitly requested
|
| 106 |
+
- **Cache:** `kv` (standard GQA cache; no MLA, no SSM, no sliding-window)
|
| 107 |
+
|
| 108 |
+
## Top-K runtime override
|
| 109 |
+
|
| 110 |
+
`JANGTQ_TOPK_OVERRIDE=4 python serve.py` lowers per-token expert count
|
| 111 |
+
from the trained 8 to 4 for ~10% decode speedup. Coherence holds on
|
| 112 |
+
short prompts in our smoke tests; long-form quality is not benchmarked.
|
| 113 |
+
The patcher refuses to set K above the trained value and logs the
|
| 114 |
+
attribute count it modified.
|
| 115 |
+
|
| 116 |
+
## Credits
|
| 117 |
+
|
| 118 |
+
- **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
|
| 119 |
+
- **Source model:** Tencent Hy3-preview team
|
| 120 |
+
- **License:** [Tencent Hy Community License](LICENSE) — non-commercial, EU/UK/SK
|
| 121 |
+
excluded; consult the LICENSE for full terms
|
| 122 |
+
|
| 123 |
+
## Validated runtime contract
|
| 124 |
+
|
| 125 |
+
- 80 layers materialize; 79 routed-expert SwitchGLU instances hydrate via
|
| 126 |
+
TurboQuantLinear (2-bit MXTQ).
|
| 127 |
+
- Capabilities verify: `family=hy_v3`, `reasoning_parser=qwen3`,
|
| 128 |
+
`tool_parser=hunyuan`, `think_in_template=False`, `supports_thinking=True`,
|
| 129 |
+
`cache_type=kv`, `modality=text`.
|
| 130 |
+
- Coherence smoke (M5 Max 128 GB):
|
| 131 |
+
- "What is 2 + 2?" → `4<|hy_eos|>` (15.2 tok/s)
|
| 132 |
+
- "The capital of France is" → top-1 ` Paris` (logit 19.13)
|
| 133 |
+
- "def fibonacci(n):" → top-1 `\n`, top-3 includes ` return`
|
| 134 |
+
- Hard-prompt benchmark coverage (HumanEval, MMLU, long-context) is
|
| 135 |
+
pending. This bundle is shipped on smoke evidence; treat results
|
| 136 |
+
beyond short prompts as preview-quality until benchmarks land.
|
| 137 |
+
|
| 138 |
+
## Runtime support matrix
|
| 139 |
+
|
| 140 |
+
| Surface | Status |
|
| 141 |
+
|---|---|
|
| 142 |
+
| `jang-tools` Python (`load_jangtq_model`) | ✅ working — this README's load snippet |
|
| 143 |
+
| `vmlx_engine` Python via re-export | pending — `vmlx_engine.loaders.load_jangtq_hy3` should re-export `jang_tools.hy3.runtime.load_hy3_model` |
|
| 144 |
+
| `vmlx-swift-lm` Swift | ❌ pending — `LLMModelFactory.dispatchHy3Unsupported` currently throws; needs new `Hy3.swift` model class + JANGTQ Swift dispatch |
|
| 145 |
+
| MTP speculative decode | preserved-disabled — weights present in bundle, accept/reject loop not yet implemented in any JANG runtime |
|