ZAYA1-8B-MXFP4 / README.md
Osaurus-AI's picture
Upload ZAYA1-8B-MXFP4 via jang-tools
93d426e verified
metadata
license: apache-2.0
library_name: mlx
base_model: Zyphra/ZAYA1-8B
base_model_relation: quantized
pipeline_tag: text-generation
tags:
  - zaya
  - mixture-of-experts
  - hybrid-attention
  - cca-attention
  - mlx
  - apple-silicon
  - reasoning
  - tool-use
  - quantized
  - mxfp4
  - jang
  - osaurus
quantization_config:
  family: mxfp4
  profile: MXFP4
  group_size: 32
  expert_layout: split_switch_mlp

OsaurusAI

ZAYA1-8B-MXFP4

Quantized Zyphra/ZAYA1-8B for Apple Silicon runtimes.

Source Zyphra/ZAYA1-8B
License Apache-2.0, inherited from upstream
Format MXFP4
Bundle size 5.48 GiB
Tensor keys 1965
Expert layout Pre-stacked zaya_block.experts.switch_mlp
Runtime status Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (coherence report did not pass); published as a format/runtime bundle pending downstream ZAYA runtime validation.

Important Runtime Note

This bundle requires a ZAYA-aware MLX/JANG runtime that implements CCA attention state and the converted pre-stacked expert layout.

ZAYA is not a stock mlx_lm architecture. It alternates CCA attention layers and top-1 MoE layers. Use this bundle only with a runtime that implements the ZAYA CCA state contract and the converted pre-stacked expert layout.

Architecture Summary

  • 80 decoder layers: 40 CCA attention layers and 40 top-1 MoE layers
  • Hidden size 2048, 16 query heads, 2 KV heads, head dim 128
  • CCA state per attention layer: standard KV plus conv_state [B,1280,2] and prev_hs [B,2048]
  • 16 routed experts per MoE layer, top-1 routing with MOD skip route
  • Context length 131072, rope_theta=5000000

Quantization

4-bit affine linears + 8-bit embeddings + passthrough router/CCA state tensors.

Passthrough floor for first release prep:

  • conv_qk.*, temp, norms, residual scaling, router path, biases, and balancing biases are preserved as float tensors.
  • Embeddings and lm_head use 8-bit affine in the prepared bundles.
  • jangtq_runtime.safetensors is not applicable to MXFP4.

mxtq_bits:

null

Bundle Verification

  • Safetensor headers scanned.
  • Source tensor coverage checked.
  • Converted bundles checked for local_experts removal.
  • Converted expert tensors checked for pre-stacked switch_mlp layout.
  • JANGTQ sidecars checked for the Swift runtime contract.
  • Runtime coherence status recorded above.

Runtime Smoke Tests

Before production use, run short deterministic prompts through the exact target runtime:

  • What is 2+2? Answer with only the number.
  • What is the capital of France? Answer with one word.
  • One chat-template prompt with thinking disabled.
  • One chat-template prompt with thinking enabled and enough output budget for the final answer.

The first public bundle release records bundle integrity and runtime contract checks. Full generation quality depends on a ZAYA-aware runtime implementation.

Korean Summary

이 λ²ˆλ“€μ€ Zyphra/ZAYA1-8Bλ₯Ό Apple Silicon MLX/JANG λŸ°νƒ€μž„μš©μœΌλ‘œ μ–‘μžν™”ν•œ λͺ¨λΈμž…λ‹ˆλ‹€. ZAYA의 CCA attention μƒνƒœμ™€ MoE λΌμš°νŒ…μ„ μ •ν™•νžˆ κ΅¬ν˜„ν•œ λŸ°νƒ€μž„μ—μ„œλ§Œ μ‚¬μš©ν•΄μ•Ό ν•©λ‹ˆλ‹€.

Files

  • config.json carries weight_format=mxfp4 and zaya_expert_layout=split_switch_mlp.
  • jang_config.json carries cache_subtype=zaya_cca.
  • Tokenizer files and chat_template.jinja are preserved from the upstream source snapshot.