Initial upload: ZAYA1-8B-MXFP4 from Zyphra/ZAYA1-8B
Browse files- README.md +20 -20
- config.json +2 -2
- jang_config.json +2 -2
README.md
CHANGED
|
@@ -35,25 +35,28 @@ Quantized **Zyphra/ZAYA1-8B** for Apple Silicon runtimes.
|
|
| 35 |
| Source | [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) |
|
| 36 |
| License | Apache-2.0, inherited from upstream |
|
| 37 |
| Format | MXFP4 |
|
|
|
|
| 38 |
| Bundle size | 5.48 GiB |
|
| 39 |
| Tensor keys | 1965 |
|
| 40 |
| Expert layout | Pre-stacked `zaya_block.experts.switch_mlp` |
|
| 41 |
-
| Runtime status | Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (coherence report
|
| 42 |
|
| 43 |
## Important Runtime Note
|
| 44 |
|
| 45 |
This bundle requires a ZAYA-aware MLX/JANG runtime that implements CCA attention state and the converted pre-stacked expert layout.
|
| 46 |
|
| 47 |
-
ZAYA is not a stock `mlx_lm` architecture. It alternates CCA attention layers
|
| 48 |
-
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
## Architecture Summary
|
| 52 |
|
| 53 |
-
- 80 decoder layers:
|
| 54 |
-
- Hidden size 2048, 16 query heads, 2 KV heads, head dim
|
| 55 |
-
- CCA state per attention layer: standard KV plus `conv_state [B,1280,2]`
|
| 56 |
-
and `prev_hs [B,2048]`
|
| 57 |
- 16 routed experts per MoE layer, top-1 routing with MOD skip route
|
| 58 |
- Context length 131072, `rope_theta=5000000`
|
| 59 |
|
|
@@ -63,9 +66,9 @@ ZAYA CCA state contract and the converted pre-stacked expert layout.
|
|
| 63 |
|
| 64 |
Passthrough floor for first release prep:
|
| 65 |
|
| 66 |
-
- `conv_qk.*`, `temp`, norms, residual scaling, router path, biases, and
|
| 67 |
-
balancing biases are preserved as float tensors.
|
| 68 |
- Embeddings and `lm_head` use 8-bit affine in the prepared bundles.
|
|
|
|
| 69 |
- `jangtq_runtime.safetensors` is not applicable to MXFP4.
|
| 70 |
|
| 71 |
`mxtq_bits`:
|
|
@@ -81,21 +84,19 @@ null
|
|
| 81 |
- Converted bundles checked for `local_experts` removal.
|
| 82 |
- Converted expert tensors checked for pre-stacked `switch_mlp` layout.
|
| 83 |
- JANGTQ sidecars checked for the Swift runtime contract.
|
|
|
|
| 84 |
- Runtime coherence status recorded above.
|
| 85 |
|
| 86 |
## Runtime Smoke Tests
|
| 87 |
|
| 88 |
-
Before production use, run short deterministic prompts through the exact target
|
| 89 |
-
runtime:
|
| 90 |
|
| 91 |
- `What is 2+2? Answer with only the number.`
|
| 92 |
- `What is the capital of France? Answer with one word.`
|
| 93 |
- One chat-template prompt with thinking disabled.
|
| 94 |
-
- One chat-template prompt with thinking enabled and enough output budget for
|
| 95 |
-
the final answer.
|
| 96 |
|
| 97 |
-
The first public bundle release records bundle integrity and runtime contract
|
| 98 |
-
checks. Full generation quality depends on a ZAYA-aware runtime implementation.
|
| 99 |
|
| 100 |
## Korean Summary
|
| 101 |
|
|
@@ -103,8 +104,7 @@ checks. Full generation quality depends on a ZAYA-aware runtime implementation.
|
|
| 103 |
|
| 104 |
## Files
|
| 105 |
|
| 106 |
-
- `config.json` carries `weight_format=mxfp4`
|
| 107 |
-
`zaya_expert_layout=split_switch_mlp`.
|
| 108 |
- `jang_config.json` carries `cache_subtype=zaya_cca`.
|
| 109 |
-
- Tokenizer files and
|
| 110 |
-
|
|
|
|
| 35 |
| Source | [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) |
|
| 36 |
| License | Apache-2.0, inherited from upstream |
|
| 37 |
| Format | MXFP4 |
|
| 38 |
+
| Modality | text |
|
| 39 |
| Bundle size | 5.48 GiB |
|
| 40 |
| Tensor keys | 1965 |
|
| 41 |
| Expert layout | Pre-stacked `zaya_block.experts.switch_mlp` |
|
| 42 |
+
| Runtime status | Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (missing coherence report); published as a format/runtime bundle pending downstream ZAYA runtime validation. |
|
| 43 |
|
| 44 |
## Important Runtime Note
|
| 45 |
|
| 46 |
This bundle requires a ZAYA-aware MLX/JANG runtime that implements CCA attention state and the converted pre-stacked expert layout.
|
| 47 |
|
| 48 |
+
ZAYA is not a stock `mlx_lm` architecture. It alternates CCA attention layers and top-1 MoE layers. Use this bundle only with a runtime that implements the ZAYA CCA state contract and the converted pre-stacked expert layout.
|
| 49 |
+
|
| 50 |
+
## Runtime Pin Required
|
| 51 |
+
|
| 52 |
+
Use a `vmlx-swift-lm` build that includes the ZAYA Swift runtime (`Libraries/MLXLLM/Models/Zaya.swift` + `MLXLMCommon/Cache/ZayaCCACache.swift` + `BatchEngine/BatchZayaCCACache.swift`). The first verified pin is commit `b9da180` or newer.
|
| 53 |
+
|
| 54 |
|
| 55 |
## Architecture Summary
|
| 56 |
|
| 57 |
+
- 80 decoder layers: alternating CCA attention and top-1 MoE
|
| 58 |
+
- Hidden size 2048, 16 query heads, 2 KV heads, head dim ?
|
| 59 |
+
- CCA state per attention layer: standard KV plus `conv_state [B,1280,2]` and `prev_hs [B,2048]`
|
|
|
|
| 60 |
- 16 routed experts per MoE layer, top-1 routing with MOD skip route
|
| 61 |
- Context length 131072, `rope_theta=5000000`
|
| 62 |
|
|
|
|
| 66 |
|
| 67 |
Passthrough floor for first release prep:
|
| 68 |
|
| 69 |
+
- `conv_qk.*`, `temp`, norms, residual scaling, router path, biases, and balancing biases are preserved as float tensors.
|
|
|
|
| 70 |
- Embeddings and `lm_head` use 8-bit affine in the prepared bundles.
|
| 71 |
+
- Text-only ZAYA1-8B has no vision_tower or LoRA tensors.
|
| 72 |
- `jangtq_runtime.safetensors` is not applicable to MXFP4.
|
| 73 |
|
| 74 |
`mxtq_bits`:
|
|
|
|
| 84 |
- Converted bundles checked for `local_experts` removal.
|
| 85 |
- Converted expert tensors checked for pre-stacked `switch_mlp` layout.
|
| 86 |
- JANGTQ sidecars checked for the Swift runtime contract.
|
| 87 |
+
- Capabilities verified: family=zaya, supports_thinking=False, tool_parser=zaya_xml.
|
| 88 |
- Runtime coherence status recorded above.
|
| 89 |
|
| 90 |
## Runtime Smoke Tests
|
| 91 |
|
| 92 |
+
Before production use, run short deterministic prompts through the exact target runtime:
|
|
|
|
| 93 |
|
| 94 |
- `What is 2+2? Answer with only the number.`
|
| 95 |
- `What is the capital of France? Answer with one word.`
|
| 96 |
- One chat-template prompt with thinking disabled.
|
| 97 |
+
- One chat-template prompt with thinking enabled and enough output budget for the final answer.
|
|
|
|
| 98 |
|
| 99 |
+
The first public bundle release records bundle integrity and runtime contract checks. Full generation quality depends on a ZAYA-aware runtime implementation.
|
|
|
|
| 100 |
|
| 101 |
## Korean Summary
|
| 102 |
|
|
|
|
| 104 |
|
| 105 |
## Files
|
| 106 |
|
| 107 |
+
- `config.json` carries `weight_format=mxfp4`, `zaya_expert_layout=split_switch_mlp`.
|
|
|
|
| 108 |
- `jang_config.json` carries `cache_subtype=zaya_cca`.
|
| 109 |
+
- Tokenizer files and chat template are preserved from the upstream source snapshot.
|
| 110 |
+
|
config.json
CHANGED
|
@@ -58,9 +58,9 @@
|
|
| 58 |
"tool_parser": "zaya_xml",
|
| 59 |
"think_in_template": true,
|
| 60 |
"supports_tools": true,
|
| 61 |
-
"supports_thinking":
|
| 62 |
"family": "zaya",
|
| 63 |
"modality": "text",
|
| 64 |
"cache_type": "hybrid"
|
| 65 |
}
|
| 66 |
-
}
|
|
|
|
| 58 |
"tool_parser": "zaya_xml",
|
| 59 |
"think_in_template": true,
|
| 60 |
"supports_tools": true,
|
| 61 |
+
"supports_thinking": false,
|
| 62 |
"family": "zaya",
|
| 63 |
"modality": "text",
|
| 64 |
"cache_type": "hybrid"
|
| 65 |
}
|
| 66 |
+
}
|
jang_config.json
CHANGED
|
@@ -20,9 +20,9 @@
|
|
| 20 |
"tool_parser": "zaya_xml",
|
| 21 |
"think_in_template": true,
|
| 22 |
"supports_tools": true,
|
| 23 |
-
"supports_thinking":
|
| 24 |
"family": "zaya",
|
| 25 |
"modality": "text",
|
| 26 |
"cache_type": "hybrid"
|
| 27 |
}
|
| 28 |
-
}
|
|
|
|
| 20 |
"tool_parser": "zaya_xml",
|
| 21 |
"think_in_template": true,
|
| 22 |
"supports_tools": true,
|
| 23 |
+
"supports_thinking": false,
|
| 24 |
"family": "zaya",
|
| 25 |
"modality": "text",
|
| 26 |
"cache_type": "hybrid"
|
| 27 |
}
|
| 28 |
+
}
|