| --- |
| license: apache-2.0 |
| library_name: mlx |
| base_model: Zyphra/ZAYA1-8B |
| base_model_relation: quantized |
| pipeline_tag: text-generation |
| tags: |
| - zaya |
| - mixture-of-experts |
| - hybrid-attention |
| - cca-attention |
| - mlx |
| - apple-silicon |
| - reasoning |
| - tool-use |
| - quantized |
| - mxfp4 |
| - jang |
| - osaurus |
| quantization_config: |
| family: mxfp4 |
| profile: MXFP4 |
| group_size: 32 |
| expert_layout: split_switch_mlp |
| --- |
| |
| <p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p> |
|
|
| # ZAYA1-8B-MXFP4 |
|
|
| Quantized **Zyphra/ZAYA1-8B** for Apple Silicon runtimes. |
|
|
| | | | |
| |---|---| |
| | Source | [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) | |
| | License | Apache-2.0, inherited from upstream | |
| | Format | MXFP4 | |
| | Bundle size | 5.48 GiB | |
| | Tensor keys | 1965 | |
| | Expert layout | Pre-stacked `zaya_block.experts.switch_mlp` | |
| | Runtime status | Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (coherence report did not pass); published as a format/runtime bundle pending downstream ZAYA runtime validation. | |
|
|
| ## Important Runtime Note |
|
|
| This bundle requires a ZAYA-aware MLX/JANG runtime that implements CCA attention state and the converted pre-stacked expert layout. |
|
|
| ZAYA is not a stock `mlx_lm` architecture. It alternates CCA attention layers |
| and top-1 MoE layers. Use this bundle only with a runtime that implements the |
| ZAYA CCA state contract and the converted pre-stacked expert layout. |
|
|
| ## Architecture Summary |
|
|
| - 80 decoder layers: 40 CCA attention layers and 40 top-1 MoE layers |
| - Hidden size 2048, 16 query heads, 2 KV heads, head dim 128 |
| - CCA state per attention layer: standard KV plus `conv_state [B,1280,2]` |
| and `prev_hs [B,2048]` |
| - 16 routed experts per MoE layer, top-1 routing with MOD skip route |
| - Context length 131072, `rope_theta=5000000` |
|
|
| ## Quantization |
|
|
| 4-bit affine linears + 8-bit embeddings + passthrough router/CCA state tensors. |
|
|
| Passthrough floor for first release prep: |
|
|
| - `conv_qk.*`, `temp`, norms, residual scaling, router path, biases, and |
| balancing biases are preserved as float tensors. |
| - Embeddings and `lm_head` use 8-bit affine in the prepared bundles. |
| - `jangtq_runtime.safetensors` is not applicable to MXFP4. |
|
|
| `mxtq_bits`: |
|
|
| ```json |
| null |
| ``` |
|
|
| ## Bundle Verification |
|
|
| - Safetensor headers scanned. |
| - Source tensor coverage checked. |
| - Converted bundles checked for `local_experts` removal. |
| - Converted expert tensors checked for pre-stacked `switch_mlp` layout. |
| - JANGTQ sidecars checked for the Swift runtime contract. |
| - Runtime coherence status recorded above. |
|
|
| ## Runtime Smoke Tests |
|
|
| Before production use, run short deterministic prompts through the exact target |
| runtime: |
|
|
| - `What is 2+2? Answer with only the number.` |
| - `What is the capital of France? Answer with one word.` |
| - One chat-template prompt with thinking disabled. |
| - One chat-template prompt with thinking enabled and enough output budget for |
| the final answer. |
|
|
| The first public bundle release records bundle integrity and runtime contract |
| checks. Full generation quality depends on a ZAYA-aware runtime implementation. |
|
|
| ## Korean Summary |
|
|
| μ΄ λ²λ€μ Zyphra/ZAYA1-8Bλ₯Ό Apple Silicon MLX/JANG λ°νμμ©μΌλ‘ μμνν λͺ¨λΈμ
λλ€. ZAYAμ CCA attention μνμ MoE λΌμ°ν
μ μ νν ꡬνν λ°νμμμλ§ μ¬μ©ν΄μΌ ν©λλ€. |
|
|
| ## Files |
|
|
| - `config.json` carries `weight_format=mxfp4` and |
| `zaya_expert_layout=split_switch_mlp`. |
| - `jang_config.json` carries `cache_subtype=zaya_cca`. |
| - Tokenizer files and `chat_template.jinja` are preserved from the upstream |
| source snapshot. |
|
|