Image-Text-to-Text
MLX
Safetensors
zaya1_vl
zaya
zaya1-vl
vision-language
mixture-of-experts
hybrid-attention
cca-attention
apple-silicon
reasoning
tool-use
quantized
jang
jangtq
jangtq-k
mixed-precision
jangtq-prestack
osaurus
conversational
Instructions to use OsaurusAI/ZAYA1-VL-8B-JANGTQ_K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use OsaurusAI/ZAYA1-VL-8B-JANGTQ_K with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("OsaurusAI/ZAYA1-VL-8B-JANGTQ_K") config = load_config("OsaurusAI/ZAYA1-VL-8B-JANGTQ_K") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio

ZAYA1-VL-8B-JANGTQ_K
Zyphra/ZAYA1-VL-8B — 5.0 GB on disk — mixed-bit JANGTQ_K quantization. Same ZAYA1 LM trunk as the text bundle + Qwen2.5-VL vision tower + image-token LoRA gating.
- Source: Zyphra/ZAYA1-VL-8B (80 LM layers alternating CCA + top-1 MoE; 32-layer ViT vision tower; per-expert LoRA fc1 / fc2 gated by image tokens)
- Quantization: mixed-bit MXTQ on routed-expert MLP projections:
down_proj: 4-bit /gate_proj: 2-bit /up_proj: 2-bit- attention / shared expert / dense / embed / lm_head: 8-bit affine
- vision tower: 8-bit affine (preserved Qwen2.5-VL Conv3d patch embed axis layout)
- image-token LoRA blocks: fp16 passthrough
- norms / router / conv_qk / biases: fp16 / fp32 passthrough
- Routed-expert layout: pre-stacked under
zaya_block.experts.switch_mlp.{{gate, up, down}}_proj(JANGTQ-PRESTACK). Sidecarjangtq_runtime.safetensorsships both 2-bit + 4-bit codebooks. - Bundle size: ~5.0 GB on-disk
- Runs on: M3 Max 32 GB+ / M4 / M5 / Mac Studio
Why mixed-bit?
Same rationale as the text bundle: ZAYA1 is top-1 MoE with MOD
passthrough, so plain 2-bit collapses past ~2-3 k cumulative output
tokens. JANGTQ_K recovers quality by spending 4 bits on down_proj
(residual-stream output) and 2 bits on the SwiGLU-gated gate / up
projections.
Loading (Python)
pip install jang-tools mlx-vlm
from jang_tools.load_jangtq_vlm import load_jangtq_vlm_model
from mlx_vlm import generate
model, processor = load_jangtq_vlm_model("OsaurusAI/ZAYA1-VL-8B-JANGTQ_K")
prompt = processor.apply_chat_template(
[{{"role": "user", "content": "What is 2 + 2?"}}],
tokenize=False, add_generation_prompt=True,
)
print(generate(model, processor, prompt, image=None,
max_tokens=64, temperature=0.0).text)
load_jangtq_vlm_model auto-builds the MLX-VLM skeleton (vision tower
- language model + image-token gating) and replaces routed-expert projections with TurboQuantLinear in the same step.
Validated runtime contract
- 80 LM layers + 32-layer ViT materialize; 40 routed-expert SwitchGLU instances hydrate at JANGTQ_K bit widths.
- Capabilities:
family=zaya1_vl,modality=vision,reasoning_parser=qwen3,tool_parser=zaya_xml,supports_thinking=True,think_in_template=False,cache_type=hybrid. - Text-path single-prompt smoke: "2+2=4", "Paris" — 57-108 tok/s on M5 Max.
- Multi-turn smoke: 3-turn code+tests+README → 5,446 chars cumulative, completely clean (no loops / no repetition).
Runtime support matrix
| Surface | Status |
|---|---|
jang-tools Python (load_jangtq_vlm_model) |
✅ working — this README's load snippet |
vmlx-swift-lm Swift |
✅ working — Libraries/MLXVLM/Models/Zaya1VL.swift + JANGTQ codebook dispatch |
Credits
- Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
- Source model: Zyphra ZAYA1 team
- License: Apache-2.0, inherited from upstream
- Downloads last month
- 160
Model size
2B params
Tensor type
U32
·
F16 ·
U8 ·
Hardware compatibility
Log In to add your hardware
Quantized