Initial upload: ZAYA1-8B-JANGTQ4 from Zyphra/ZAYA1-8B

7e4fa89 verified 16 days ago

4.12 kB

	---
	license: apache-2.0
	library_name: mlx
	base_model: Zyphra/ZAYA1-8B
	base_model_relation: quantized
	pipeline_tag: text-generation
	tags:
	- zaya
	- mixture-of-experts
	- hybrid-attention
	- cca-attention
	- mlx
	- apple-silicon
	- reasoning
	- tool-use
	- quantized
	- jang
	- jangtq
	- mxtq
	- jangtq-prestack
	- osaurus
	quantization_config:
	family: jangtq
	profile: JANGTQ4
	group_size: 32
	expert_layout: split_switch_mlp
	---

	<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>

	# ZAYA1-8B-JANGTQ4

	Quantized Zyphra/ZAYA1-8B for Apple Silicon runtimes.

	\| \| \|
	\|---\|---\|
	\| Source \| [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) \|
	\| License \| Apache-2.0, inherited from upstream \|
	\| Format \| JANGTQ4 \|
	\| Modality \| text \|
	\| Bundle size \| 4.65 GiB \|
	\| Tensor keys \| 1965 \|
	\| Expert layout \| Pre-stacked `zaya_block.experts.switch_mlp` \|
	\| Runtime status \| Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (missing coherence report); published as a format/runtime bundle pending downstream ZAYA runtime validation. \|

	## Important Runtime Note

	This bundle requires a ZAYA-aware JANGTQ runtime that implements CCA attention state plus pre-stacked `switch_mlp` TurboQuant experts.

	ZAYA is not a stock `mlx_lm` architecture. It alternates CCA attention layers and top-1 MoE layers. Use this bundle only with a runtime that implements the ZAYA CCA state contract and the converted pre-stacked expert layout.

	## Runtime Pin Required

	Use a `vmlx-swift-lm` build that includes the ZAYA Swift runtime (`Libraries/MLXLLM/Models/Zaya.swift` + `MLXLMCommon/Cache/ZayaCCACache.swift` + `BatchEngine/BatchZayaCCACache.swift`). The first verified pin is commit `b9da180` or newer.


	## Architecture Summary

	- 80 decoder layers: alternating CCA attention and top-1 MoE
	- Hidden size 2048, 16 query heads, 2 KV heads, head dim ?
	- CCA state per attention layer: standard KV plus `conv_state [B,1280,2]` and `prev_hs [B,2048]`
	- 16 routed experts per MoE layer, top-1 routing with MOD skip route
	- Context length 131072, `rope_theta=5000000`

	## Quantization

	4-bit MXTQ routed experts + 8-bit affine non-routed tensors.

	Passthrough floor for first release prep:

	- `conv_qk.*`, `temp`, norms, residual scaling, router path, biases, and balancing biases are preserved as float tensors.
	- Embeddings and `lm_head` use 8-bit affine in the prepared bundles.
	- Text-only ZAYA1-8B has no vision_tower or LoRA tensors.
	- `jangtq_runtime.safetensors` is included: true.

	`mxtq_bits`:

	```json
	{
	"routed_expert": 4,
	"attention": 8,
	"router": 16,
	"embed_tokens": 8,
	"lm_head": 8,
	"cca_conv": 16,
	"norms_residual": 16
	}
	```

	## Bundle Verification

	- Safetensor headers scanned.
	- Source tensor coverage checked.
	- Converted bundles checked for `local_experts` removal.
	- Converted expert tensors checked for pre-stacked `switch_mlp` layout.
	- JANGTQ sidecars checked for the Swift runtime contract.
	- Capabilities verified: family=zaya, supports_thinking=False, tool_parser=zaya_xml.
	- Runtime coherence status recorded above.

	## Runtime Smoke Tests

	Before production use, run short deterministic prompts through the exact target runtime:

	- `What is 2+2? Answer with only the number.`
	- `What is the capital of France? Answer with one word.`
	- One chat-template prompt with thinking disabled.
	- One chat-template prompt with thinking enabled and enough output budget for the final answer.

	The first public bundle release records bundle integrity and runtime contract checks. Full generation quality depends on a ZAYA-aware runtime implementation.

	## Korean Summary

	이 번들은 Zyphra/ZAYA1-8B를 Apple Silicon MLX/JANG 런타임용으로 양자화한 모델입니다. ZAYA의 CCA attention 상태와 MoE 라우팅을 정확히 구현한 런타임에서만 사용해야 합니다.

	## Files

	- `config.json` carries `weight_format=mxtq`, `zaya_expert_layout=split_switch_mlp`.
	- `jang_config.json` carries `cache_subtype=zaya_cca`.
	- Tokenizer files and chat template are preserved from the upstream source snapshot.