OsaurusAI

ZAYA1-VL-8B-JANGTQ_K

Zyphra/ZAYA1-VL-8B — 5.0 GB on diskmixed-bit JANGTQ_K quantization. Same ZAYA1 LM trunk as the text bundle + Qwen2.5-VL vision tower + image-token LoRA gating.

  • Source: Zyphra/ZAYA1-VL-8B (80 LM layers alternating CCA + top-1 MoE; 32-layer ViT vision tower; per-expert LoRA fc1 / fc2 gated by image tokens)
  • Quantization: mixed-bit MXTQ on routed-expert MLP projections:
    • down_proj: 4-bit / gate_proj: 2-bit / up_proj: 2-bit
    • attention / shared expert / dense / embed / lm_head: 8-bit affine
    • vision tower: 8-bit affine (preserved Qwen2.5-VL Conv3d patch embed axis layout)
    • image-token LoRA blocks: fp16 passthrough
    • norms / router / conv_qk / biases: fp16 / fp32 passthrough
  • Routed-expert layout: pre-stacked under zaya_block.experts.switch_mlp.{{gate, up, down}}_proj (JANGTQ-PRESTACK). Sidecar jangtq_runtime.safetensors ships both 2-bit + 4-bit codebooks.
  • Bundle size: ~5.0 GB on-disk
  • Runs on: M3 Max 32 GB+ / M4 / M5 / Mac Studio

Why mixed-bit?

Same rationale as the text bundle: ZAYA1 is top-1 MoE with MOD passthrough, so plain 2-bit collapses past ~2-3 k cumulative output tokens. JANGTQ_K recovers quality by spending 4 bits on down_proj (residual-stream output) and 2 bits on the SwiGLU-gated gate / up projections.

Loading (Python)

pip install jang-tools mlx-vlm
from jang_tools.load_jangtq_vlm import load_jangtq_vlm_model
from mlx_vlm import generate

model, processor = load_jangtq_vlm_model("OsaurusAI/ZAYA1-VL-8B-JANGTQ_K")

prompt = processor.apply_chat_template(
    [{{"role": "user", "content": "What is 2 + 2?"}}],
    tokenize=False, add_generation_prompt=True,
)
print(generate(model, processor, prompt, image=None,
               max_tokens=64, temperature=0.0).text)

load_jangtq_vlm_model auto-builds the MLX-VLM skeleton (vision tower

  • language model + image-token gating) and replaces routed-expert projections with TurboQuantLinear in the same step.

Validated runtime contract

  • 80 LM layers + 32-layer ViT materialize; 40 routed-expert SwitchGLU instances hydrate at JANGTQ_K bit widths.
  • Capabilities: family=zaya1_vl, modality=vision, reasoning_parser=qwen3, tool_parser=zaya_xml, supports_thinking=True, think_in_template=False, cache_type=hybrid.
  • Text-path single-prompt smoke: "2+2=4", "Paris" — 57-108 tok/s on M5 Max.
  • Multi-turn smoke: 3-turn code+tests+README → 5,446 chars cumulative, completely clean (no loops / no repetition).

Runtime support matrix

Surface Status
jang-tools Python (load_jangtq_vlm_model) ✅ working — this README's load snippet
vmlx-swift-lm Swift ✅ working — Libraries/MLXVLM/Models/Zaya1VL.swift + JANGTQ codebook dispatch

Credits

  • Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
  • Source model: Zyphra ZAYA1 team
  • License: Apache-2.0, inherited from upstream
Downloads last month
160
Safetensors
Model size
2B params
Tensor type
U32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OsaurusAI/ZAYA1-VL-8B-JANGTQ_K

Quantized
(5)
this model