OsaurusAI
/

ZAYA1-8B-MXFP4

@@ -35,25 +35,28 @@ Quantized **Zyphra/ZAYA1-8B** for Apple Silicon runtimes.
 | Source | [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) |
 | License | Apache-2.0, inherited from upstream |
 | Format | MXFP4 |
 | Bundle size | 5.48 GiB |
 | Tensor keys | 1965 |
 | Expert layout | Pre-stacked `zaya_block.experts.switch_mlp` |
-| Runtime status | Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (coherence report did not pass); published as a format/runtime bundle pending downstream ZAYA runtime validation. |
 ## Important Runtime Note
 This bundle requires a ZAYA-aware MLX/JANG runtime that implements CCA attention state and the converted pre-stacked expert layout.
-ZAYA is not a stock `mlx_lm` architecture. It alternates CCA attention layers
-and top-1 MoE layers. Use this bundle only with a runtime that implements the
-ZAYA CCA state contract and the converted pre-stacked expert layout.
 ## Architecture Summary
-- 80 decoder layers: 40 CCA attention layers and 40 top-1 MoE layers
-- Hidden size 2048, 16 query heads, 2 KV heads, head dim 128
-- CCA state per attention layer: standard KV plus `conv_state [B,1280,2]`
-  and `prev_hs [B,2048]`
 - 16 routed experts per MoE layer, top-1 routing with MOD skip route
 - Context length 131072, `rope_theta=5000000`
@@ -63,9 +66,9 @@ ZAYA CCA state contract and the converted pre-stacked expert layout.
 Passthrough floor for first release prep:
-- `conv_qk.*`, `temp`, norms, residual scaling, router path, biases, and
-  balancing biases are preserved as float tensors.
 - Embeddings and `lm_head` use 8-bit affine in the prepared bundles.
 - `jangtq_runtime.safetensors` is not applicable to MXFP4.
 `mxtq_bits`:
@@ -81,21 +84,19 @@ null
 - Converted bundles checked for `local_experts` removal.
 - Converted expert tensors checked for pre-stacked `switch_mlp` layout.
 - JANGTQ sidecars checked for the Swift runtime contract.
 - Runtime coherence status recorded above.
 ## Runtime Smoke Tests
-Before production use, run short deterministic prompts through the exact target
-runtime:
 - `What is 2+2? Answer with only the number.`
 - `What is the capital of France? Answer with one word.`
 - One chat-template prompt with thinking disabled.
-- One chat-template prompt with thinking enabled and enough output budget for
-  the final answer.
-The first public bundle release records bundle integrity and runtime contract
-checks. Full generation quality depends on a ZAYA-aware runtime implementation.
 ## Korean Summary
@@ -103,8 +104,7 @@ checks. Full generation quality depends on a ZAYA-aware runtime implementation.
 ## Files
-- `config.json` carries `weight_format=mxfp4` and
-  `zaya_expert_layout=split_switch_mlp`.
 - `jang_config.json` carries `cache_subtype=zaya_cca`.
-- Tokenizer files and `chat_template.jinja` are preserved from the upstream
-  source snapshot.

 | Source | [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) |
 | License | Apache-2.0, inherited from upstream |
 | Format | MXFP4 |
+| Modality | text |
 | Bundle size | 5.48 GiB |
 | Tensor keys | 1965 |
 | Expert layout | Pre-stacked `zaya_block.experts.switch_mlp` |
+| Runtime status | Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (missing coherence report); published as a format/runtime bundle pending downstream ZAYA runtime validation. |
 ## Important Runtime Note
 This bundle requires a ZAYA-aware MLX/JANG runtime that implements CCA attention state and the converted pre-stacked expert layout.
+ZAYA is not a stock `mlx_lm` architecture. It alternates CCA attention layers and top-1 MoE layers. Use this bundle only with a runtime that implements the ZAYA CCA state contract and the converted pre-stacked expert layout.
+## Runtime Pin Required
+Use a `vmlx-swift-lm` build that includes the ZAYA Swift runtime (`Libraries/MLXLLM/Models/Zaya.swift` + `MLXLMCommon/Cache/ZayaCCACache.swift` + `BatchEngine/BatchZayaCCACache.swift`). The first verified pin is commit `b9da180` or newer.
 ## Architecture Summary
+- 80 decoder layers: alternating CCA attention and top-1 MoE
+- Hidden size 2048, 16 query heads, 2 KV heads, head dim ?
+- CCA state per attention layer: standard KV plus `conv_state [B,1280,2]` and `prev_hs [B,2048]`
 - 16 routed experts per MoE layer, top-1 routing with MOD skip route
 - Context length 131072, `rope_theta=5000000`
 Passthrough floor for first release prep:
+- `conv_qk.*`, `temp`, norms, residual scaling, router path, biases, and balancing biases are preserved as float tensors.
 - Embeddings and `lm_head` use 8-bit affine in the prepared bundles.
+- Text-only ZAYA1-8B has no vision_tower or LoRA tensors.
 - `jangtq_runtime.safetensors` is not applicable to MXFP4.
 `mxtq_bits`:
 - Converted bundles checked for `local_experts` removal.
 - Converted expert tensors checked for pre-stacked `switch_mlp` layout.
 - JANGTQ sidecars checked for the Swift runtime contract.
+- Capabilities verified: family=zaya, supports_thinking=False, tool_parser=zaya_xml.
 - Runtime coherence status recorded above.
 ## Runtime Smoke Tests
+Before production use, run short deterministic prompts through the exact target runtime:
 - `What is 2+2? Answer with only the number.`
 - `What is the capital of France? Answer with one word.`
 - One chat-template prompt with thinking disabled.
+- One chat-template prompt with thinking enabled and enough output budget for the final answer.
+The first public bundle release records bundle integrity and runtime contract checks. Full generation quality depends on a ZAYA-aware runtime implementation.
 ## Korean Summary
 ## Files
+- `config.json` carries `weight_format=mxfp4`, `zaya_expert_layout=split_switch_mlp`.
 - `jang_config.json` carries `cache_subtype=zaya_cca`.
+- Tokenizer files and chat template are preserved from the upstream source snapshot.

config.json CHANGED Viewed

@@ -58,9 +58,9 @@
     "tool_parser": "zaya_xml",
     "think_in_template": true,
     "supports_tools": true,
-    "supports_thinking": true,
     "family": "zaya",
     "modality": "text",
     "cache_type": "hybrid"
   }
-}

     "tool_parser": "zaya_xml",
     "think_in_template": true,
     "supports_tools": true,
+    "supports_thinking": false,
     "family": "zaya",
     "modality": "text",
     "cache_type": "hybrid"
   }
+}

jang_config.json CHANGED Viewed

@@ -20,9 +20,9 @@
     "tool_parser": "zaya_xml",
     "think_in_template": true,
     "supports_tools": true,
-    "supports_thinking": true,
     "family": "zaya",
     "modality": "text",
     "cache_type": "hybrid"
   }
-}

     "tool_parser": "zaya_xml",
     "think_in_template": true,
     "supports_tools": true,
+    "supports_thinking": false,
     "family": "zaya",
     "modality": "text",
     "cache_type": "hybrid"
   }
+}