File size: 3,580 Bytes
93d426e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | ---
license: apache-2.0
library_name: mlx
base_model: Zyphra/ZAYA1-8B
base_model_relation: quantized
pipeline_tag: text-generation
tags:
- zaya
- mixture-of-experts
- hybrid-attention
- cca-attention
- mlx
- apple-silicon
- reasoning
- tool-use
- quantized
- mxfp4
- jang
- osaurus
quantization_config:
family: mxfp4
profile: MXFP4
group_size: 32
expert_layout: split_switch_mlp
---
<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>
# ZAYA1-8B-MXFP4
Quantized **Zyphra/ZAYA1-8B** for Apple Silicon runtimes.
| | |
|---|---|
| Source | [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) |
| License | Apache-2.0, inherited from upstream |
| Format | MXFP4 |
| Bundle size | 5.48 GiB |
| Tensor keys | 1965 |
| Expert layout | Pre-stacked `zaya_block.experts.switch_mlp` |
| Runtime status | Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (coherence report did not pass); published as a format/runtime bundle pending downstream ZAYA runtime validation. |
## Important Runtime Note
This bundle requires a ZAYA-aware MLX/JANG runtime that implements CCA attention state and the converted pre-stacked expert layout.
ZAYA is not a stock `mlx_lm` architecture. It alternates CCA attention layers
and top-1 MoE layers. Use this bundle only with a runtime that implements the
ZAYA CCA state contract and the converted pre-stacked expert layout.
## Architecture Summary
- 80 decoder layers: 40 CCA attention layers and 40 top-1 MoE layers
- Hidden size 2048, 16 query heads, 2 KV heads, head dim 128
- CCA state per attention layer: standard KV plus `conv_state [B,1280,2]`
and `prev_hs [B,2048]`
- 16 routed experts per MoE layer, top-1 routing with MOD skip route
- Context length 131072, `rope_theta=5000000`
## Quantization
4-bit affine linears + 8-bit embeddings + passthrough router/CCA state tensors.
Passthrough floor for first release prep:
- `conv_qk.*`, `temp`, norms, residual scaling, router path, biases, and
balancing biases are preserved as float tensors.
- Embeddings and `lm_head` use 8-bit affine in the prepared bundles.
- `jangtq_runtime.safetensors` is not applicable to MXFP4.
`mxtq_bits`:
```json
null
```
## Bundle Verification
- Safetensor headers scanned.
- Source tensor coverage checked.
- Converted bundles checked for `local_experts` removal.
- Converted expert tensors checked for pre-stacked `switch_mlp` layout.
- JANGTQ sidecars checked for the Swift runtime contract.
- Runtime coherence status recorded above.
## Runtime Smoke Tests
Before production use, run short deterministic prompts through the exact target
runtime:
- `What is 2+2? Answer with only the number.`
- `What is the capital of France? Answer with one word.`
- One chat-template prompt with thinking disabled.
- One chat-template prompt with thinking enabled and enough output budget for
the final answer.
The first public bundle release records bundle integrity and runtime contract
checks. Full generation quality depends on a ZAYA-aware runtime implementation.
## Korean Summary
μ΄ λ²λ€μ Zyphra/ZAYA1-8Bλ₯Ό Apple Silicon MLX/JANG λ°νμμ©μΌλ‘ μμνν λͺ¨λΈμ
λλ€. ZAYAμ CCA attention μνμ MoE λΌμ°ν
μ μ νν ꡬνν λ°νμμμλ§ μ¬μ©ν΄μΌ ν©λλ€.
## Files
- `config.json` carries `weight_format=mxfp4` and
`zaya_expert_layout=split_switch_mlp`.
- `jang_config.json` carries `cache_subtype=zaya_cca`.
- Tokenizer files and `chat_template.jinja` are preserved from the upstream
source snapshot.
|