Text Generation
MLX
Safetensors
zaya
mixture-of-experts
hybrid-attention
cca-attention
apple-silicon
reasoning
tool-use
quantized
jang
jangtq
jangtq-k
mixed-precision
mxtq
jangtq-prestack
osaurus
conversational
Instructions to use OsaurusAI/ZAYA1-8B-JANGTQ_K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use OsaurusAI/ZAYA1-8B-JANGTQ_K with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("OsaurusAI/ZAYA1-8B-JANGTQ_K") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use OsaurusAI/ZAYA1-8B-JANGTQ_K with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/ZAYA1-8B-JANGTQ_K"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "OsaurusAI/ZAYA1-8B-JANGTQ_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use OsaurusAI/ZAYA1-8B-JANGTQ_K with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/ZAYA1-8B-JANGTQ_K"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default OsaurusAI/ZAYA1-8B-JANGTQ_K
Run Hermes
hermes
- MLX LM
How to use OsaurusAI/ZAYA1-8B-JANGTQ_K with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "OsaurusAI/ZAYA1-8B-JANGTQ_K"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "OsaurusAI/ZAYA1-8B-JANGTQ_K" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OsaurusAI/ZAYA1-8B-JANGTQ_K", "messages": [ {"role": "user", "content": "Hello"} ] }'
| license: apache-2.0 | |
| library_name: mlx | |
| base_model: Zyphra/ZAYA1-8B | |
| base_model_relation: quantized | |
| pipeline_tag: text-generation | |
| tags: | |
| - zaya | |
| - mixture-of-experts | |
| - hybrid-attention | |
| - cca-attention | |
| - mlx | |
| - apple-silicon | |
| - reasoning | |
| - tool-use | |
| - quantized | |
| - jang | |
| - jangtq | |
| - jangtq-k | |
| - mixed-precision | |
| - mxtq | |
| - jangtq-prestack | |
| - osaurus | |
| <p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p> | |
| # ZAYA1-8B-JANGTQ_K | |
| **Zyphra/ZAYA1-8B — 3.4 GB on disk** — **mixed-bit JANGTQ_K** quantization | |
| that recovers ZAYA's quality at the 2-3k cumulative-token coherence | |
| ceiling where the prior `JANGTQ2` tier collapsed. | |
| - **Source:** [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) | |
| (80 layers alternating CCA attention + top-1 MoE, 16 routed experts + | |
| MOD skip route, 8.4 B total / 760 M active, hybrid cache) | |
| - **Quantization:** **mixed-bit MXTQ** on routed experts: | |
| - `down_proj`: **4-bit** (output enters residual stream — most sensitive) | |
| - `gate_proj`: **2-bit** (gated through SwiGLU) | |
| - `up_proj`: **2-bit** (multiplied with gate) | |
| - attention / embed / lm_head: 8-bit affine | |
| - norms / router / conv_qk / biases: fp16 / fp32 passthrough | |
| - **Routed-expert layout:** **pre-stacked along axis 0** under | |
| `zaya_block.experts.switch_mlp.{gate_proj, up_proj, down_proj}` per | |
| the JANGTQ-PRESTACK standard. Sidecar `jangtq_runtime.safetensors` | |
| (~8 KB) ships both `(in=2048, bits=2)` and `(in=2048, bits=4)` | |
| codebooks + sign-flip vector for Swift runtimes. | |
| - **Bundle size:** **~3.4 GB on-disk** (~2.67 bits avg routed weight) | |
| - **Runs on:** M3 Max 32 GB+ / M4 / M5 / Mac Studio | |
| ## Why mixed-bit? | |
| ZAYA1-8B is **top-1 MoE with MOD passthrough** — every routed token rides | |
| ONE expert's quantization error, with no top-k averaging to smooth out | |
| the noise. At plain 2-bit (`JANGTQ2`) the residual stream accumulates | |
| codebook noise and collapses into short-phrase loops past ~2-3 k | |
| cumulative output tokens (documented at | |
| `~/osaurus-staging/docs/JANGTQ2_QUALITY_LIMITS.md`). | |
| `JANGTQ_K` spends 4 bits on `down_proj` (the projection whose output | |
| feeds the residual stream) and keeps 2 bits on `gate_proj` / `up_proj` | |
| (gated through SwiGLU's multiplicative path, much less sensitive). Same | |
| total budget as ~2.67-bit but quality close to 4-bit on the matmul | |
| whose noise actually matters. | |
| ## Loading (Python) | |
| ```bash | |
| pip install jang-tools mlx-lm | |
| ``` | |
| ```python | |
| from jang_tools.load_jangtq import load_jangtq_model | |
| model, tokenizer = load_jangtq_model("OsaurusAI/ZAYA1-8B-JANGTQ_K") | |
| chat = tokenizer.apply_chat_template( | |
| [{{"role": "user", "content": "What is 2 + 2?"}}], | |
| tokenize=False, | |
| add_generation_prompt=True, | |
| ) | |
| ``` | |
| `load_jangtq_model` auto-registers `model_type=zaya` via | |
| `jang_tools.zaya` before building the MLX skeleton. | |
| ## Validated runtime contract | |
| - 80 layers materialize; 40 sparse-MoE layers hydrate routed experts via | |
| TurboQuantLinear with per-projection bit widths (gate=2 / up=2 / down=4). | |
| - Capabilities: `family=zaya`, `reasoning_parser=qwen3`, | |
| `tool_parser=zaya_xml`, `supports_thinking=True`, | |
| `think_in_template=False`, `cache_type=hybrid`. | |
| - Single-prompt smoke: "2+2=4", "Paris", recursive `fibonacci(n)` — | |
| short, on-topic, fast. | |
| - **Multi-turn smoke**: 3-turn code+tests+README run → 6,177 chars | |
| cumulative, well past the 2-3 k JANGTQ2 ceiling, **no loops / no | |
| repetition / no off-topic collapse**. | |
| ## Runtime support matrix | |
| | Surface | Status | | |
| |---|---| | |
| | `jang-tools` Python (`load_jangtq_model`) | ✅ working — this README's load snippet | | |
| | `vmlx-swift-lm` Swift | ✅ working — `Libraries/MLXLLM/Models/Zaya.swift` + JANGTQ codebook dispatch | | |
| ## Reasoning + tools | |
| - **Reasoning parser:** `qwen3` (extracts `<think>...</think>` blocks) | |
| - **Tool parser:** `zaya_xml` (Zyphra wrapper around standard XML tool | |
| calls — see `Tool/Parsers/ZayaXMLToolCallParser.swift`) | |
| - **Cache:** `hybrid` (CCA + standard KV; convolution state preserved | |
| per CCA layer + previous-hidden-state side-channel) | |
| ## Credits | |
| - **Quantization + MLX runtime:** Jinho Jang (eric@osaurus.ai) | |
| - **Source model:** Zyphra ZAYA1 team | |
| - **License:** Apache-2.0, inherited from upstream | |