--- license: apache-2.0 library_name: mlx base_model: Zyphra/ZAYA1-8B base_model_relation: quantized pipeline_tag: text-generation tags: - zaya - mixture-of-experts - hybrid-attention - cca-attention - mlx - apple-silicon - reasoning - tool-use - quantized - jang - jangtq - jangtq-k - mixed-precision - mxtq - jangtq-prestack - osaurus ---

OsaurusAI

# ZAYA1-8B-JANGTQ_K **Zyphra/ZAYA1-8B — 3.4 GB on disk** — **mixed-bit JANGTQ_K** quantization that recovers ZAYA's quality at the 2-3k cumulative-token coherence ceiling where the prior `JANGTQ2` tier collapsed. - **Source:** [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) (80 layers alternating CCA attention + top-1 MoE, 16 routed experts + MOD skip route, 8.4 B total / 760 M active, hybrid cache) - **Quantization:** **mixed-bit MXTQ** on routed experts: - `down_proj`: **4-bit** (output enters residual stream — most sensitive) - `gate_proj`: **2-bit** (gated through SwiGLU) - `up_proj`: **2-bit** (multiplied with gate) - attention / embed / lm_head: 8-bit affine - norms / router / conv_qk / biases: fp16 / fp32 passthrough - **Routed-expert layout:** **pre-stacked along axis 0** under `zaya_block.experts.switch_mlp.{gate_proj, up_proj, down_proj}` per the JANGTQ-PRESTACK standard. Sidecar `jangtq_runtime.safetensors` (~8 KB) ships both `(in=2048, bits=2)` and `(in=2048, bits=4)` codebooks + sign-flip vector for Swift runtimes. - **Bundle size:** **~3.4 GB on-disk** (~2.67 bits avg routed weight) - **Runs on:** M3 Max 32 GB+ / M4 / M5 / Mac Studio ## Why mixed-bit? ZAYA1-8B is **top-1 MoE with MOD passthrough** — every routed token rides ONE expert's quantization error, with no top-k averaging to smooth out the noise. At plain 2-bit (`JANGTQ2`) the residual stream accumulates codebook noise and collapses into short-phrase loops past ~2-3 k cumulative output tokens (documented at `~/osaurus-staging/docs/JANGTQ2_QUALITY_LIMITS.md`). `JANGTQ_K` spends 4 bits on `down_proj` (the projection whose output feeds the residual stream) and keeps 2 bits on `gate_proj` / `up_proj` (gated through SwiGLU's multiplicative path, much less sensitive). Same total budget as ~2.67-bit but quality close to 4-bit on the matmul whose noise actually matters. ## Loading (Python) ```bash pip install jang-tools mlx-lm ``` ```python from jang_tools.load_jangtq import load_jangtq_model model, tokenizer = load_jangtq_model("OsaurusAI/ZAYA1-8B-JANGTQ_K") chat = tokenizer.apply_chat_template( [{{"role": "user", "content": "What is 2 + 2?"}}], tokenize=False, add_generation_prompt=True, ) ``` `load_jangtq_model` auto-registers `model_type=zaya` via `jang_tools.zaya` before building the MLX skeleton. ## Validated runtime contract - 80 layers materialize; 40 sparse-MoE layers hydrate routed experts via TurboQuantLinear with per-projection bit widths (gate=2 / up=2 / down=4). - Capabilities: `family=zaya`, `reasoning_parser=qwen3`, `tool_parser=zaya_xml`, `supports_thinking=True`, `think_in_template=False`, `cache_type=hybrid`. - Single-prompt smoke: "2+2=4", "Paris", recursive `fibonacci(n)` — short, on-topic, fast. - **Multi-turn smoke**: 3-turn code+tests+README run → 6,177 chars cumulative, well past the 2-3 k JANGTQ2 ceiling, **no loops / no repetition / no off-topic collapse**. ## Runtime support matrix | Surface | Status | |---|---| | `jang-tools` Python (`load_jangtq_model`) | ✅ working — this README's load snippet | | `vmlx-swift-lm` Swift | ✅ working — `Libraries/MLXLLM/Models/Zaya.swift` + JANGTQ codebook dispatch | ## Reasoning + tools - **Reasoning parser:** `qwen3` (extracts `...` blocks) - **Tool parser:** `zaya_xml` (Zyphra wrapper around standard XML tool calls — see `Tool/Parsers/ZayaXMLToolCallParser.swift`) - **Cache:** `hybrid` (CCA + standard KV; convolution state preserved per CCA layer + previous-hidden-state side-channel) ## Credits - **Quantization + MLX runtime:** Jinho Jang (eric@osaurus.ai) - **Source model:** Zyphra ZAYA1 team - **License:** Apache-2.0, inherited from upstream