Instructions to use OsaurusAI/ZAYA1-8B-JANGTQ_K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OsaurusAI/ZAYA1-8B-JANGTQ_K with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("OsaurusAI/ZAYA1-8B-JANGTQ_K")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use OsaurusAI/ZAYA1-8B-JANGTQ_K with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "OsaurusAI/ZAYA1-8B-JANGTQ_K"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "OsaurusAI/ZAYA1-8B-JANGTQ_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use OsaurusAI/ZAYA1-8B-JANGTQ_K with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "OsaurusAI/ZAYA1-8B-JANGTQ_K"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default OsaurusAI/ZAYA1-8B-JANGTQ_K

Run Hermes

hermes

MLX LM

How to use OsaurusAI/ZAYA1-8B-JANGTQ_K with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "OsaurusAI/ZAYA1-8B-JANGTQ_K"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "OsaurusAI/ZAYA1-8B-JANGTQ_K"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "OsaurusAI/ZAYA1-8B-JANGTQ_K",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

ZAYA1-8B-JANGTQ_K / README.md

Osaurus-AI

Initial JANGTQ_K release: mixed-bit (down=4, gate=2, up=2) routed experts

d0bf184 verified 16 days ago

preview code

raw

history blame contribute delete

4.22 kB

	---
	license: apache-2.0
	library_name: mlx
	base_model: Zyphra/ZAYA1-8B
	base_model_relation: quantized
	pipeline_tag: text-generation
	tags:
	- zaya
	- mixture-of-experts
	- hybrid-attention
	- cca-attention
	- mlx
	- apple-silicon
	- reasoning
	- tool-use
	- quantized
	- jang
	- jangtq
	- jangtq-k
	- mixed-precision
	- mxtq
	- jangtq-prestack
	- osaurus
	---

	<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>

	# ZAYA1-8B-JANGTQ_K

	Zyphra/ZAYA1-8B — 3.4 GB on disk — mixed-bit JANGTQ_K quantization
	that recovers ZAYA's quality at the 2-3k cumulative-token coherence
	ceiling where the prior `JANGTQ2` tier collapsed.

	- Source: [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B)
	(80 layers alternating CCA attention + top-1 MoE, 16 routed experts +
	MOD skip route, 8.4 B total / 760 M active, hybrid cache)
	- Quantization: mixed-bit MXTQ on routed experts:
	- `down_proj`: 4-bit (output enters residual stream — most sensitive)
	- `gate_proj`: 2-bit (gated through SwiGLU)
	- `up_proj`: 2-bit (multiplied with gate)
	- attention / embed / lm_head: 8-bit affine
	- norms / router / conv_qk / biases: fp16 / fp32 passthrough
	- Routed-expert layout: pre-stacked along axis 0 under
	`zaya_block.experts.switch_mlp.{gate_proj, up_proj, down_proj}` per
	the JANGTQ-PRESTACK standard. Sidecar `jangtq_runtime.safetensors`
	(~8 KB) ships both `(in=2048, bits=2)` and `(in=2048, bits=4)`
	codebooks + sign-flip vector for Swift runtimes.
	- Bundle size: ~3.4 GB on-disk (~2.67 bits avg routed weight)
	- Runs on: M3 Max 32 GB+ / M4 / M5 / Mac Studio

	## Why mixed-bit?

	ZAYA1-8B is top-1 MoE with MOD passthrough — every routed token rides
	ONE expert's quantization error, with no top-k averaging to smooth out
	the noise. At plain 2-bit (`JANGTQ2`) the residual stream accumulates
	codebook noise and collapses into short-phrase loops past ~2-3 k
	cumulative output tokens (documented at
	`~/osaurus-staging/docs/JANGTQ2_QUALITY_LIMITS.md`).

	`JANGTQ_K` spends 4 bits on `down_proj` (the projection whose output
	feeds the residual stream) and keeps 2 bits on `gate_proj` / `up_proj`
	(gated through SwiGLU's multiplicative path, much less sensitive). Same
	total budget as ~2.67-bit but quality close to 4-bit on the matmul
	whose noise actually matters.

	## Loading (Python)

	```bash
	pip install jang-tools mlx-lm
	```

	```python
	from jang_tools.load_jangtq import load_jangtq_model

	model, tokenizer = load_jangtq_model("OsaurusAI/ZAYA1-8B-JANGTQ_K")

	chat = tokenizer.apply_chat_template(
	[{{"role": "user", "content": "What is 2 + 2?"}}],
	tokenize=False,
	add_generation_prompt=True,
	)
	```

	`load_jangtq_model` auto-registers `model_type=zaya` via
	`jang_tools.zaya` before building the MLX skeleton.

	## Validated runtime contract

	- 80 layers materialize; 40 sparse-MoE layers hydrate routed experts via
	TurboQuantLinear with per-projection bit widths (gate=2 / up=2 / down=4).
	- Capabilities: `family=zaya`, `reasoning_parser=qwen3`,
	`tool_parser=zaya_xml`, `supports_thinking=True`,
	`think_in_template=False`, `cache_type=hybrid`.
	- Single-prompt smoke: "2+2=4", "Paris", recursive `fibonacci(n)` —
	short, on-topic, fast.
	- Multi-turn smoke: 3-turn code+tests+README run → 6,177 chars
	cumulative, well past the 2-3 k JANGTQ2 ceiling, **no loops / no
	repetition / no off-topic collapse**.

	## Runtime support matrix

	\| Surface \| Status \|
	\|---\|---\|
	\| `jang-tools` Python (`load_jangtq_model`) \| ✅ working — this README's load snippet \|
	\| `vmlx-swift-lm` Swift \| ✅ working — `Libraries/MLXLLM/Models/Zaya.swift` + JANGTQ codebook dispatch \|

	## Reasoning + tools

	- Reasoning parser: `qwen3` (extracts `<think>...</think>` blocks)
	- Tool parser: `zaya_xml` (Zyphra wrapper around standard XML tool
	calls — see `Tool/Parsers/ZayaXMLToolCallParser.swift`)
	- Cache: `hybrid` (CCA + standard KV; convolution state preserved
	per CCA layer + previous-hidden-state side-channel)

	## Credits

	- Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
	- Source model: Zyphra ZAYA1 team
	- License: Apache-2.0, inherited from upstream