---
license: apache-2.0
library_name: mlx
base_model: Zyphra/ZAYA1-8B
base_model_relation: quantized
pipeline_tag: text-generation
tags:
  - zaya
  - mixture-of-experts
  - hybrid-attention
  - cca-attention
  - mlx
  - apple-silicon
  - reasoning
  - tool-use
  - quantized
  - jang
  - jangtq
  - jangtq-k
  - mixed-precision
  - mxtq
  - jangtq-prestack
  - osaurus
---

<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>

# ZAYA1-8B-JANGTQ_K

**Zyphra/ZAYA1-8B — 3.4 GB on disk** — **mixed-bit JANGTQ_K** quantization
that recovers ZAYA's quality at the 2-3k cumulative-token coherence
ceiling where the prior `JANGTQ2` tier collapsed.

- **Source:** [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B)
  (80 layers alternating CCA attention + top-1 MoE, 16 routed experts +
  MOD skip route, 8.4 B total / 760 M active, hybrid cache)
- **Quantization:** **mixed-bit MXTQ** on routed experts:
  - `down_proj`: **4-bit** (output enters residual stream — most sensitive)
  - `gate_proj`: **2-bit** (gated through SwiGLU)
  - `up_proj`:   **2-bit** (multiplied with gate)
  - attention / embed / lm_head: 8-bit affine
  - norms / router / conv_qk / biases: fp16 / fp32 passthrough
- **Routed-expert layout:** **pre-stacked along axis 0** under
  `zaya_block.experts.switch_mlp.{gate_proj, up_proj, down_proj}` per
  the JANGTQ-PRESTACK standard. Sidecar `jangtq_runtime.safetensors`
  (~8 KB) ships both `(in=2048, bits=2)` and `(in=2048, bits=4)`
  codebooks + sign-flip vector for Swift runtimes.
- **Bundle size:** **~3.4 GB on-disk** (~2.67 bits avg routed weight)
- **Runs on:** M3 Max 32 GB+ / M4 / M5 / Mac Studio

## Why mixed-bit?

ZAYA1-8B is **top-1 MoE with MOD passthrough** — every routed token rides
ONE expert's quantization error, with no top-k averaging to smooth out
the noise. At plain 2-bit (`JANGTQ2`) the residual stream accumulates
codebook noise and collapses into short-phrase loops past ~2-3 k
cumulative output tokens (documented at
`~/osaurus-staging/docs/JANGTQ2_QUALITY_LIMITS.md`).

`JANGTQ_K` spends 4 bits on `down_proj` (the projection whose output
feeds the residual stream) and keeps 2 bits on `gate_proj` / `up_proj`
(gated through SwiGLU's multiplicative path, much less sensitive). Same
total budget as ~2.67-bit but quality close to 4-bit on the matmul
whose noise actually matters.

## Loading (Python)

```bash
pip install jang-tools mlx-lm
```

```python
from jang_tools.load_jangtq import load_jangtq_model

model, tokenizer = load_jangtq_model("OsaurusAI/ZAYA1-8B-JANGTQ_K")

chat = tokenizer.apply_chat_template(
    [{{"role": "user", "content": "What is 2 + 2?"}}],
    tokenize=False,
    add_generation_prompt=True,
)
```

`load_jangtq_model` auto-registers `model_type=zaya` via
`jang_tools.zaya` before building the MLX skeleton.

## Validated runtime contract

- 80 layers materialize; 40 sparse-MoE layers hydrate routed experts via
  TurboQuantLinear with per-projection bit widths (gate=2 / up=2 / down=4).
- Capabilities: `family=zaya`, `reasoning_parser=qwen3`,
  `tool_parser=zaya_xml`, `supports_thinking=True`,
  `think_in_template=False`, `cache_type=hybrid`.
- Single-prompt smoke: "2+2=4", "Paris", recursive `fibonacci(n)` —
  short, on-topic, fast.
- **Multi-turn smoke**: 3-turn code+tests+README run → 6,177 chars
  cumulative, well past the 2-3 k JANGTQ2 ceiling, **no loops / no
  repetition / no off-topic collapse**.

## Runtime support matrix

| Surface | Status |
|---|---|
| `jang-tools` Python (`load_jangtq_model`) | ✅ working — this README's load snippet |
| `vmlx-swift-lm` Swift | ✅ working — `Libraries/MLXLLM/Models/Zaya.swift` + JANGTQ codebook dispatch |

## Reasoning + tools

- **Reasoning parser:** `qwen3` (extracts `<think>...</think>` blocks)
- **Tool parser:** `zaya_xml` (Zyphra wrapper around standard XML tool
  calls — see `Tool/Parsers/ZayaXMLToolCallParser.swift`)
- **Cache:** `hybrid` (CCA + standard KV; convolution state preserved
  per CCA layer + previous-hidden-state side-channel)

## Credits

- **Quantization + MLX runtime:** Jinho Jang (eric@osaurus.ai)
- **Source model:** Zyphra ZAYA1 team
- **License:** Apache-2.0, inherited from upstream