ZAYA1-VL-8B-JANGTQ_K

Zyphra/ZAYA1-VL-8B — 5.0 GB on disk — mixed-bit JANGTQ_K quantization. Same ZAYA1 LM trunk as the text bundle + Qwen2.5-VL vision tower + image-token LoRA gating.

Source: Zyphra/ZAYA1-VL-8B (80 LM layers alternating CCA + top-1 MoE; 32-layer ViT vision tower; per-expert LoRA fc1 / fc2 gated by image tokens)
Quantization: mixed-bit MXTQ on routed-expert MLP projections:
- down_proj: 4-bit / gate_proj: 2-bit / up_proj: 2-bit
- attention / shared expert / dense / embed / lm_head: 8-bit affine
- vision tower: 8-bit affine (preserved Qwen2.5-VL Conv3d patch embed axis layout)
- image-token LoRA blocks: fp16 passthrough
- norms / router / conv_qk / biases: fp16 / fp32 passthrough
Routed-expert layout: pre-stacked under zaya_block.experts.switch_mlp.{{gate, up, down}}_proj (JANGTQ-PRESTACK). Sidecar jangtq_runtime.safetensors ships both 2-bit + 4-bit codebooks.
Bundle size: ~5.0 GB on-disk
Runs on: M3 Max 32 GB+ / M4 / M5 / Mac Studio

Why mixed-bit?

Same rationale as the text bundle: ZAYA1 is top-1 MoE with MOD passthrough, so plain 2-bit collapses past ~2-3 k cumulative output tokens. JANGTQ_K recovers quality by spending 4 bits on down_proj (residual-stream output) and 2 bits on the SwiGLU-gated gate / up projections.

Loading (Python)

pip install jang-tools mlx-vlm

from jang_tools.load_jangtq_vlm import load_jangtq_vlm_model
from mlx_vlm import generate

model, processor = load_jangtq_vlm_model("OsaurusAI/ZAYA1-VL-8B-JANGTQ_K")

prompt = processor.apply_chat_template(
    [{{"role": "user", "content": "What is 2 + 2?"}}],
    tokenize=False, add_generation_prompt=True,
)
print(generate(model, processor, prompt, image=None,
               max_tokens=64, temperature=0.0).text)

load_jangtq_vlm_model auto-builds the MLX-VLM skeleton (vision tower

language model + image-token gating) and replaces routed-expert projections with TurboQuantLinear in the same step.

Validated runtime contract

80 LM layers + 32-layer ViT materialize; 40 routed-expert SwitchGLU instances hydrate at JANGTQ_K bit widths.
Capabilities: family=zaya1_vl, modality=vision, reasoning_parser=qwen3, tool_parser=zaya_xml, supports_thinking=True, think_in_template=False, cache_type=hybrid.
Text-path single-prompt smoke: "2+2=4", "Paris" — 57-108 tok/s on M5 Max.
Multi-turn smoke: 3-turn code+tests+README → 5,446 chars cumulative, completely clean (no loops / no repetition).

Runtime support matrix

Surface	Status
`jang-tools` Python (`load_jangtq_vlm_model`)	✅ working — this README's load snippet
`vmlx-swift-lm` Swift	✅ working — `Libraries/MLXVLM/Models/Zaya1VL.swift` + JANGTQ codebook dispatch

Credits

Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
Source model: Zyphra ZAYA1 team
License: Apache-2.0, inherited from upstream

Downloads last month: 160

Safetensors

Model size

2B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for OsaurusAI/ZAYA1-VL-8B-JANGTQ_K

Base model

Zyphra/ZAYA1-base

Finetuned

Zyphra/ZAYA1-VL-8B

Quantized

(5)

this model