Instructions to use OsaurusAI/ZAYA1-VL-8B-JANGTQ4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use OsaurusAI/ZAYA1-VL-8B-JANGTQ4 with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("OsaurusAI/ZAYA1-VL-8B-JANGTQ4") config = load_config("OsaurusAI/ZAYA1-VL-8B-JANGTQ4") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio

ZAYA1-VL-8B-JANGTQ4
Quantized Zyphra/ZAYA1-VL-8B for Apple Silicon runtimes.
| Source | Zyphra/ZAYA1-VL-8B |
| License | Apache-2.0, inherited from upstream |
| Format | JANGTQ4 |
| Modality | image+text |
| Bundle size | 6.29 GiB |
| Tensor keys | 5315 |
| Expert layout | Pre-stacked zaya_block.experts.switch_mlp |
| Runtime status | Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (missing coherence report); published as a format/runtime bundle pending downstream ZAYA runtime validation. |
Important Runtime Note
This bundle requires a ZAYA-aware JANGTQ runtime that implements CCA attention state plus pre-stacked switch_mlp TurboQuant experts.
ZAYA1-VL fuses Zyphra's text-ZAYA decoder (CCA attention + top-1 MoE) with the Qwen2.5-VL vision tower. Vision-LoRA modulates the LM trunk only at vision-token positions; text positions decode unmodified. Use this bundle only with a runtime that implements the ZAYA CCA state contract and the converted pre-stacked expert layout.
Runtime Pin Required
Zyphra model_type=zaya1_vl is not yet implemented in vmlx-swift-lm or stock mlx_vlm. The bundle is conversion-ready and structurally verified but image-text decoding requires either:
- Zyphra's
transformers @ git+https://github.com/Zyphra/transformers.git@zaya1-vlfork (BF16 source-side reference), or - A
Zaya1VLMLX adapter (in development atjang-runtime/Sources/JANG/Zaya1VL/).
Until the MLX adapter ships, treat this bundle as a runtime-pending preview.
Architecture Summary
- 40 hybrid decoder layers: each layer has CCA attention + top-1 MoE
- Hidden size 2048, 8 query heads, 2 KV heads, head dim 128
- 16 routed experts per MoE layer, top-1 routing
- Vision tower: Qwen2.5-VL ViT (
hidden=1280,out=2048,patch=14) - Vision-LoRA on the LM trunk: rank-8 attn, rank-32 MLP, gated to vision tokens only
- Image tokens:
image_token_id=262147, start=255999, end=256000 - Context length 32768,
rope_theta=1000000, partial RoPE (0.5 of head dim)
Quantization
4-bit MXTQ routed experts + 8-bit affine LM linears + passthrough vision tower / LoRA.
Passthrough floor for first release prep:
conv_qk.*,temp, norms, residual scaling, router path, biases, and balancing biases are preserved as float tensors.- Embeddings and
lm_headuse 8-bit affine in the prepared bundles. - Vision tower (
vision_tower.*) and all LoRA tensors (*.lora_*.[01].weight) are kept in float passthrough. jangtq_runtime.safetensorsis included: true.
mxtq_bits:
{
"routed_expert": 4,
"attention": 8,
"router": 16,
"embed_tokens": 8,
"lm_head": 8,
"cca_conv": 16,
"norms_residual": 16
}
Bundle Verification
- Safetensor headers scanned.
- Source tensor coverage checked.
- Converted bundles checked for
local_expertsremoval. - Converted expert tensors checked for pre-stacked
switch_mlplayout. - JANGTQ sidecars checked for the Swift runtime contract.
- Vision tower + LoRA tensors verified passthrough; image_token_id, vision_start/end preserved.
- Runtime coherence status recorded above.
Runtime Smoke Tests
Before production use, run short deterministic prompts through the exact target runtime:
What is 2+2? Answer with only the number.What is the capital of France? Answer with one word.- One chat-template prompt with thinking disabled.
- One image+text prompt exercising vision-token interleave and the vision-LoRA gate.
The first public bundle release records bundle integrity and runtime contract checks. Full generation quality depends on a ZAYA-aware runtime implementation.
Korean Summary
이 번들은 Zyphra/ZAYA1-VL-8B를 Apple Silicon MLX/JANG 런타임용으로 양자화한 모델입니다. ZAYA의 CCA attention 상태와 MoE 라우팅을 정확히 구현한 런타임에서만 사용해야 합니다. 이미지 입력은 Qwen2.5-VL ViT 경로를 거치며, vision-LoRA는 이미지 토큰 위치에서만 적용됩니다.
Files
config.jsoncarriesweight_format=mxtq,zaya_expert_layout=split_switch_mlp, and a preservedvision_config.jang_config.jsoncarriescache_subtype=zaya_cca.- Tokenizer files and chat template are preserved from the upstream source snapshot.
preprocessor_config.json(Qwen2VLImageProcessor) is included for image input.
- Downloads last month
- 195
Quantized