Osaurus-AI commited on
Commit
ce0df96
·
verified ·
1 Parent(s): 93d426e

Initial upload: ZAYA1-8B-MXFP4 from Zyphra/ZAYA1-8B

Browse files
Files changed (3) hide show
  1. README.md +20 -20
  2. config.json +2 -2
  3. jang_config.json +2 -2
README.md CHANGED
@@ -35,25 +35,28 @@ Quantized **Zyphra/ZAYA1-8B** for Apple Silicon runtimes.
35
  | Source | [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) |
36
  | License | Apache-2.0, inherited from upstream |
37
  | Format | MXFP4 |
 
38
  | Bundle size | 5.48 GiB |
39
  | Tensor keys | 1965 |
40
  | Expert layout | Pre-stacked `zaya_block.experts.switch_mlp` |
41
- | Runtime status | Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (coherence report did not pass); published as a format/runtime bundle pending downstream ZAYA runtime validation. |
42
 
43
  ## Important Runtime Note
44
 
45
  This bundle requires a ZAYA-aware MLX/JANG runtime that implements CCA attention state and the converted pre-stacked expert layout.
46
 
47
- ZAYA is not a stock `mlx_lm` architecture. It alternates CCA attention layers
48
- and top-1 MoE layers. Use this bundle only with a runtime that implements the
49
- ZAYA CCA state contract and the converted pre-stacked expert layout.
 
 
 
50
 
51
  ## Architecture Summary
52
 
53
- - 80 decoder layers: 40 CCA attention layers and 40 top-1 MoE layers
54
- - Hidden size 2048, 16 query heads, 2 KV heads, head dim 128
55
- - CCA state per attention layer: standard KV plus `conv_state [B,1280,2]`
56
- and `prev_hs [B,2048]`
57
  - 16 routed experts per MoE layer, top-1 routing with MOD skip route
58
  - Context length 131072, `rope_theta=5000000`
59
 
@@ -63,9 +66,9 @@ ZAYA CCA state contract and the converted pre-stacked expert layout.
63
 
64
  Passthrough floor for first release prep:
65
 
66
- - `conv_qk.*`, `temp`, norms, residual scaling, router path, biases, and
67
- balancing biases are preserved as float tensors.
68
  - Embeddings and `lm_head` use 8-bit affine in the prepared bundles.
 
69
  - `jangtq_runtime.safetensors` is not applicable to MXFP4.
70
 
71
  `mxtq_bits`:
@@ -81,21 +84,19 @@ null
81
  - Converted bundles checked for `local_experts` removal.
82
  - Converted expert tensors checked for pre-stacked `switch_mlp` layout.
83
  - JANGTQ sidecars checked for the Swift runtime contract.
 
84
  - Runtime coherence status recorded above.
85
 
86
  ## Runtime Smoke Tests
87
 
88
- Before production use, run short deterministic prompts through the exact target
89
- runtime:
90
 
91
  - `What is 2+2? Answer with only the number.`
92
  - `What is the capital of France? Answer with one word.`
93
  - One chat-template prompt with thinking disabled.
94
- - One chat-template prompt with thinking enabled and enough output budget for
95
- the final answer.
96
 
97
- The first public bundle release records bundle integrity and runtime contract
98
- checks. Full generation quality depends on a ZAYA-aware runtime implementation.
99
 
100
  ## Korean Summary
101
 
@@ -103,8 +104,7 @@ checks. Full generation quality depends on a ZAYA-aware runtime implementation.
103
 
104
  ## Files
105
 
106
- - `config.json` carries `weight_format=mxfp4` and
107
- `zaya_expert_layout=split_switch_mlp`.
108
  - `jang_config.json` carries `cache_subtype=zaya_cca`.
109
- - Tokenizer files and `chat_template.jinja` are preserved from the upstream
110
- source snapshot.
 
35
  | Source | [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) |
36
  | License | Apache-2.0, inherited from upstream |
37
  | Format | MXFP4 |
38
+ | Modality | text |
39
  | Bundle size | 5.48 GiB |
40
  | Tensor keys | 1965 |
41
  | Expert layout | Pre-stacked `zaya_block.experts.switch_mlp` |
42
+ | Runtime status | Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (missing coherence report); published as a format/runtime bundle pending downstream ZAYA runtime validation. |
43
 
44
  ## Important Runtime Note
45
 
46
  This bundle requires a ZAYA-aware MLX/JANG runtime that implements CCA attention state and the converted pre-stacked expert layout.
47
 
48
+ ZAYA is not a stock `mlx_lm` architecture. It alternates CCA attention layers and top-1 MoE layers. Use this bundle only with a runtime that implements the ZAYA CCA state contract and the converted pre-stacked expert layout.
49
+
50
+ ## Runtime Pin Required
51
+
52
+ Use a `vmlx-swift-lm` build that includes the ZAYA Swift runtime (`Libraries/MLXLLM/Models/Zaya.swift` + `MLXLMCommon/Cache/ZayaCCACache.swift` + `BatchEngine/BatchZayaCCACache.swift`). The first verified pin is commit `b9da180` or newer.
53
+
54
 
55
  ## Architecture Summary
56
 
57
+ - 80 decoder layers: alternating CCA attention and top-1 MoE
58
+ - Hidden size 2048, 16 query heads, 2 KV heads, head dim ?
59
+ - CCA state per attention layer: standard KV plus `conv_state [B,1280,2]` and `prev_hs [B,2048]`
 
60
  - 16 routed experts per MoE layer, top-1 routing with MOD skip route
61
  - Context length 131072, `rope_theta=5000000`
62
 
 
66
 
67
  Passthrough floor for first release prep:
68
 
69
+ - `conv_qk.*`, `temp`, norms, residual scaling, router path, biases, and balancing biases are preserved as float tensors.
 
70
  - Embeddings and `lm_head` use 8-bit affine in the prepared bundles.
71
+ - Text-only ZAYA1-8B has no vision_tower or LoRA tensors.
72
  - `jangtq_runtime.safetensors` is not applicable to MXFP4.
73
 
74
  `mxtq_bits`:
 
84
  - Converted bundles checked for `local_experts` removal.
85
  - Converted expert tensors checked for pre-stacked `switch_mlp` layout.
86
  - JANGTQ sidecars checked for the Swift runtime contract.
87
+ - Capabilities verified: family=zaya, supports_thinking=False, tool_parser=zaya_xml.
88
  - Runtime coherence status recorded above.
89
 
90
  ## Runtime Smoke Tests
91
 
92
+ Before production use, run short deterministic prompts through the exact target runtime:
 
93
 
94
  - `What is 2+2? Answer with only the number.`
95
  - `What is the capital of France? Answer with one word.`
96
  - One chat-template prompt with thinking disabled.
97
+ - One chat-template prompt with thinking enabled and enough output budget for the final answer.
 
98
 
99
+ The first public bundle release records bundle integrity and runtime contract checks. Full generation quality depends on a ZAYA-aware runtime implementation.
 
100
 
101
  ## Korean Summary
102
 
 
104
 
105
  ## Files
106
 
107
+ - `config.json` carries `weight_format=mxfp4`, `zaya_expert_layout=split_switch_mlp`.
 
108
  - `jang_config.json` carries `cache_subtype=zaya_cca`.
109
+ - Tokenizer files and chat template are preserved from the upstream source snapshot.
110
+
config.json CHANGED
@@ -58,9 +58,9 @@
58
  "tool_parser": "zaya_xml",
59
  "think_in_template": true,
60
  "supports_tools": true,
61
- "supports_thinking": true,
62
  "family": "zaya",
63
  "modality": "text",
64
  "cache_type": "hybrid"
65
  }
66
- }
 
58
  "tool_parser": "zaya_xml",
59
  "think_in_template": true,
60
  "supports_tools": true,
61
+ "supports_thinking": false,
62
  "family": "zaya",
63
  "modality": "text",
64
  "cache_type": "hybrid"
65
  }
66
+ }
jang_config.json CHANGED
@@ -20,9 +20,9 @@
20
  "tool_parser": "zaya_xml",
21
  "think_in_template": true,
22
  "supports_tools": true,
23
- "supports_thinking": true,
24
  "family": "zaya",
25
  "modality": "text",
26
  "cache_type": "hybrid"
27
  }
28
- }
 
20
  "tool_parser": "zaya_xml",
21
  "think_in_template": true,
22
  "supports_tools": true,
23
+ "supports_thinking": false,
24
  "family": "zaya",
25
  "modality": "text",
26
  "cache_type": "hybrid"
27
  }
28
+ }