ZAYA1-8B-JANGTQ_K / README.md
Osaurus-AI's picture
Initial JANGTQ_K release: mixed-bit (down=4, gate=2, up=2) routed experts
d0bf184 verified
---
license: apache-2.0
library_name: mlx
base_model: Zyphra/ZAYA1-8B
base_model_relation: quantized
pipeline_tag: text-generation
tags:
- zaya
- mixture-of-experts
- hybrid-attention
- cca-attention
- mlx
- apple-silicon
- reasoning
- tool-use
- quantized
- jang
- jangtq
- jangtq-k
- mixed-precision
- mxtq
- jangtq-prestack
- osaurus
---
<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>
# ZAYA1-8B-JANGTQ_K
**Zyphra/ZAYA1-8B — 3.4 GB on disk****mixed-bit JANGTQ_K** quantization
that recovers ZAYA's quality at the 2-3k cumulative-token coherence
ceiling where the prior `JANGTQ2` tier collapsed.
- **Source:** [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B)
(80 layers alternating CCA attention + top-1 MoE, 16 routed experts +
MOD skip route, 8.4 B total / 760 M active, hybrid cache)
- **Quantization:** **mixed-bit MXTQ** on routed experts:
- `down_proj`: **4-bit** (output enters residual stream — most sensitive)
- `gate_proj`: **2-bit** (gated through SwiGLU)
- `up_proj`: **2-bit** (multiplied with gate)
- attention / embed / lm_head: 8-bit affine
- norms / router / conv_qk / biases: fp16 / fp32 passthrough
- **Routed-expert layout:** **pre-stacked along axis 0** under
`zaya_block.experts.switch_mlp.{gate_proj, up_proj, down_proj}` per
the JANGTQ-PRESTACK standard. Sidecar `jangtq_runtime.safetensors`
(~8 KB) ships both `(in=2048, bits=2)` and `(in=2048, bits=4)`
codebooks + sign-flip vector for Swift runtimes.
- **Bundle size:** **~3.4 GB on-disk** (~2.67 bits avg routed weight)
- **Runs on:** M3 Max 32 GB+ / M4 / M5 / Mac Studio
## Why mixed-bit?
ZAYA1-8B is **top-1 MoE with MOD passthrough** — every routed token rides
ONE expert's quantization error, with no top-k averaging to smooth out
the noise. At plain 2-bit (`JANGTQ2`) the residual stream accumulates
codebook noise and collapses into short-phrase loops past ~2-3 k
cumulative output tokens (documented at
`~/osaurus-staging/docs/JANGTQ2_QUALITY_LIMITS.md`).
`JANGTQ_K` spends 4 bits on `down_proj` (the projection whose output
feeds the residual stream) and keeps 2 bits on `gate_proj` / `up_proj`
(gated through SwiGLU's multiplicative path, much less sensitive). Same
total budget as ~2.67-bit but quality close to 4-bit on the matmul
whose noise actually matters.
## Loading (Python)
```bash
pip install jang-tools mlx-lm
```
```python
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model("OsaurusAI/ZAYA1-8B-JANGTQ_K")
chat = tokenizer.apply_chat_template(
[{{"role": "user", "content": "What is 2 + 2?"}}],
tokenize=False,
add_generation_prompt=True,
)
```
`load_jangtq_model` auto-registers `model_type=zaya` via
`jang_tools.zaya` before building the MLX skeleton.
## Validated runtime contract
- 80 layers materialize; 40 sparse-MoE layers hydrate routed experts via
TurboQuantLinear with per-projection bit widths (gate=2 / up=2 / down=4).
- Capabilities: `family=zaya`, `reasoning_parser=qwen3`,
`tool_parser=zaya_xml`, `supports_thinking=True`,
`think_in_template=False`, `cache_type=hybrid`.
- Single-prompt smoke: "2+2=4", "Paris", recursive `fibonacci(n)`
short, on-topic, fast.
- **Multi-turn smoke**: 3-turn code+tests+README run → 6,177 chars
cumulative, well past the 2-3 k JANGTQ2 ceiling, **no loops / no
repetition / no off-topic collapse**.
## Runtime support matrix
| Surface | Status |
|---|---|
| `jang-tools` Python (`load_jangtq_model`) | ✅ working — this README's load snippet |
| `vmlx-swift-lm` Swift | ✅ working — `Libraries/MLXLLM/Models/Zaya.swift` + JANGTQ codebook dispatch |
## Reasoning + tools
- **Reasoning parser:** `qwen3` (extracts `<think>...</think>` blocks)
- **Tool parser:** `zaya_xml` (Zyphra wrapper around standard XML tool
calls — see `Tool/Parsers/ZayaXMLToolCallParser.swift`)
- **Cache:** `hybrid` (CCA + standard KV; convolution state preserved
per CCA layer + previous-hidden-state side-channel)
## Credits
- **Quantization + MLX runtime:** Jinho Jang (eric@osaurus.ai)
- **Source model:** Zyphra ZAYA1 team
- **License:** Apache-2.0, inherited from upstream