Osaurus-AI commited on
Commit
f8c5df9
·
verified ·
1 Parent(s): c2d6f85

Add OsaurusAI README + banner

Browse files
Files changed (1) hide show
  1. README.md +145 -0
README.md ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: tencent-hy-community
4
+ license_link: LICENSE
5
+ library_name: mlx
6
+ tags:
7
+ - mlx
8
+ - jang
9
+ - jangtq
10
+ - hy3
11
+ - hunyuan
12
+ - hy_v3
13
+ - moe
14
+ - apple-silicon
15
+ - 2bit
16
+ - 295b
17
+ - osaurus
18
+ pipeline_tag: text-generation
19
+ base_model: tencent/Hy3-preview
20
+ base_model_relation: quantized
21
+ ---
22
+
23
+ <p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>
24
+
25
+ # Hy3-preview-JANGTQ
26
+
27
+ **Tencent Hy3-preview — 79 GB on disk** (down from the ~557 GB BF16 source) —
28
+ 2-bit **JANGTQ** quantization on routed experts + 8-bit affine elsewhere.
29
+
30
+ - **Source:** [tencent/Hy3-preview](https://huggingface.co/tencent/Hy3-preview)
31
+ (Hy3 architecture, 295B total / 21B active, BF16 native, 256K context,
32
+ 80 transformer layers + 1 MTP, 192 routed experts top-8 + 1 shared)
33
+ - **Quantization:** **JANGTQ** — 2-bit MXTQ codebook (Hadamard-rotated,
34
+ Lloyd-Max optimized) on routed-expert weights + 8-bit affine on
35
+ attention / shared expert / dense layer-0 / embed / lm_head / MTP
36
+ matmuls + fp16 passthrough on RMSNorms / router gate / `expert_bias`
37
+ - **MTP:** layer 80 weights preserved (`mtp_mode=preserved_disabled`);
38
+ decode is one-token-per-forward until accept/reject speculative loop
39
+ ships
40
+ - **Bundle size:** **79 GB on-disk** across 85 shards
41
+ - **Runs on:** M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+
42
+
43
+ ## What's in the bundle
44
+
45
+ | Module | Source dtype | Bundle dtype |
46
+ |---|---|---|
47
+ | Routed experts (192 × 3 mats × 79 sparse layers, per-expert layout) | BF16 | **2-bit MXTQ** + sidecar codebook |
48
+ | Attention q/k/v/o + q/k norms | BF16 | 8-bit affine g=64 |
49
+ | Shared expert (gate/up/down) | BF16 | 8-bit affine g=64 |
50
+ | Dense layer-0 MLP | BF16 | 8-bit affine g=64 |
51
+ | `embed_tokens` / `lm_head` | BF16 | 8-bit affine g=64 |
52
+ | MTP layer matmuls | BF16 | 8-bit affine g=64 (preserved_disabled) |
53
+ | RMSNorms / `router.gate.weight` / `expert_bias` | BF16 / F32 | fp16 passthrough |
54
+
55
+ `jangtq_runtime.safetensors` sidecar (~22 KB) for Swift runtimes — covers
56
+ `(in_features={1536, 4096}, seed=42, bits=2)` codebooks + sign-flip vectors.
57
+
58
+ ## Loading (Python)
59
+
60
+ ```bash
61
+ pip install jang-tools mlx-lm
62
+ ```
63
+
64
+ ```python
65
+ from jang_tools.load_jangtq import load_jangtq_model
66
+
67
+ model, tokenizer = load_jangtq_model("OsaurusAI/Hy3-preview-JANGTQ")
68
+
69
+ chat = tokenizer.apply_chat_template(
70
+ [{"role": "user", "content": "What is 2 + 2? Answer briefly."}],
71
+ tokenize=False,
72
+ add_generation_prompt=True,
73
+ reasoning_effort="no_think",
74
+ )
75
+ ```
76
+
77
+ `load_jangtq_model` auto-registers `model_type=hy_v3` via
78
+ `jang_tools.hy3` before building the MLX skeleton. The loader applies
79
+ the standard SwitchGLU fused gate+up + P15 router compile + P18 QKV
80
+ fusion patches automatically. Two Hy3-specific runtime fixes are baked
81
+ in:
82
+
83
+ 1. **fp32 lm_head**. `enable_lm_head_fp32=True` in the bundle config —
84
+ `Model.__call__` dequantizes the quantized lm_head and accumulates
85
+ the 4096-dim contraction in fp32 (mirrors DSV4's pattern). bf16
86
+ accumulation drifts logits by ~0.5/elem and flips top-k token picks
87
+ toward high-baseline-energy junk tokens.
88
+ 2. **qk_norm under JANGTQ P18 QKV fusion**. JANGTQ's QKV-fusion patch
89
+ replaces the attention `__call__`; `Hy3Attention` declares
90
+ `use_qk_norm=True` and uses `Hy3HeadRMSNorm` to auto-reshape flat
91
+ `[B, L, n_heads * head_dim]` input to per-head shape so RMSNorm
92
+ normalizes over `head_dim`, not over the entire flat dimension.
93
+
94
+ Decode ~15 tok/s greedy on M5 Max 128 GB at `reasoning_effort=no_think`.
95
+
96
+ ## Reasoning + tools
97
+
98
+ - **Reasoning parser:** `qwen3` (extracts `<think>...</think>` blocks)
99
+ - **Tool parser:** `hunyuan` (Tencent XML-like:
100
+ `<tool_calls><tool_call>name<tool_sep><arg_key>k</arg_key><arg_value>v</arg_value></tool_call></tool_calls>`)
101
+ - **Reasoning effort:** `no_think` (default) | `low` | `high` — pass via
102
+ `apply_chat_template(..., reasoning_effort="…")`
103
+ - **Default rendering:** template emits a closed `<think></think>` for
104
+ `no_think` mode; the runtime should NOT auto-open a reasoning prefix
105
+ unless `low` or `high` is explicitly requested
106
+ - **Cache:** `kv` (standard GQA cache; no MLA, no SSM, no sliding-window)
107
+
108
+ ## Top-K runtime override
109
+
110
+ `JANGTQ_TOPK_OVERRIDE=4 python serve.py` lowers per-token expert count
111
+ from the trained 8 to 4 for ~10% decode speedup. Coherence holds on
112
+ short prompts in our smoke tests; long-form quality is not benchmarked.
113
+ The patcher refuses to set K above the trained value and logs the
114
+ attribute count it modified.
115
+
116
+ ## Credits
117
+
118
+ - **Quantization + MLX runtime:** Jinho Jang ([eric@osaurus.ai](mailto:eric@osaurus.ai))
119
+ - **Source model:** Tencent Hy3-preview team
120
+ - **License:** [Tencent Hy Community License](LICENSE) — non-commercial, EU/UK/SK
121
+ excluded; consult the LICENSE for full terms
122
+
123
+ ## Validated runtime contract
124
+
125
+ - 80 layers materialize; 79 routed-expert SwitchGLU instances hydrate via
126
+ TurboQuantLinear (2-bit MXTQ).
127
+ - Capabilities verify: `family=hy_v3`, `reasoning_parser=qwen3`,
128
+ `tool_parser=hunyuan`, `think_in_template=False`, `supports_thinking=True`,
129
+ `cache_type=kv`, `modality=text`.
130
+ - Coherence smoke (M5 Max 128 GB):
131
+ - "What is 2 + 2?" → `4<|hy_eos|>` (15.2 tok/s)
132
+ - "The capital of France is" → top-1 ` Paris` (logit 19.13)
133
+ - "def fibonacci(n):" → top-1 `\n`, top-3 includes ` return`
134
+ - Hard-prompt benchmark coverage (HumanEval, MMLU, long-context) is
135
+ pending. This bundle is shipped on smoke evidence; treat results
136
+ beyond short prompts as preview-quality until benchmarks land.
137
+
138
+ ## Runtime support matrix
139
+
140
+ | Surface | Status |
141
+ |---|---|
142
+ | `jang-tools` Python (`load_jangtq_model`) | ✅ working — this README's load snippet |
143
+ | `vmlx_engine` Python via re-export | pending — `vmlx_engine.loaders.load_jangtq_hy3` should re-export `jang_tools.hy3.runtime.load_hy3_model` |
144
+ | `vmlx-swift-lm` Swift | ❌ pending — `LLMModelFactory.dispatchHy3Unsupported` currently throws; needs new `Hy3.swift` model class + JANGTQ Swift dispatch |
145
+ | MTP speculative decode | preserved-disabled — weights present in bundle, accept/reject loop not yet implemented in any JANG runtime |