DSV4-Flash JANGTQ2 (uniform 2-bit, --variant std, MTP preserved-disabled)

9e6f5b7 verified 12 days ago

4.26 kB

	---
	license: mit
	license_name: deepseek-license
	library_name: mlx
	base_model: deepseek-ai/DeepSeek-V4-Flash
	base_model_relation: quantized
	pipeline_tag: text-generation
	tags:
	- mlx
	- jang
	- jangtq
	- jangtq2
	- jangtq-prestack
	- mxtq
	- deepseek
	- deepseek-v4
	- deepseek-v4-flash
	- moe
	- mla
	- hash-layers
	- mtp
	- apple-silicon
	- osaurus
	---

	<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>

	# DeepSeek-V4-Flash-JANGTQ2

	DeepSeek-V4-Flash — 79.6 GB on disk (down from 149 GB FP4+FP8 source) —
	uniform 2-bit JANGTQ quantization on routed experts + 8-bit affine on
	everything else + preserved MTP head.

	- Source: [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)
	(43 transformer layers + 1 MTP head, **256 routed experts top-6 + 1
	shared expert, 3 hash layers**, MLA + mHC residuals, ~284 B total)
	- Quantization: uniform 2-bit MXTQ on routed-expert MLP +
	8-bit affine on attention (`wq_a/wq_b/wkv/wo_a/wo_b`) / shared
	expert / Compressor / Indexer / embed / lm_head / MTP. RMSNorms,
	router gate, mHC fn matrices, attn_sink, ape stay fp16/fp32
	passthrough.
	- Variant: `std` (preserves MTP layer 43; one-token-per-forward
	until a JANG runtime ships the accept/reject speculative-decode loop).
	The companion `DeepSeek-V4-Flash-JANGTQ-K` variant drops MTP for a
	smaller bundle.
	- Routed-expert layout: pre-stacked along axis 0 under
	`ffn.experts.switch_mlp.{{gate_proj, up_proj, down_proj}}` per the
	JANGTQ-PRESTACK STANDARD. Sidecar `jangtq_runtime.safetensors`
	(~24 KB) ships both `(in=2048, bits=2)` and `(in=4096, bits=2)`
	codebooks + sign-flip vectors for Swift runtimes.
	- Bundle size: ~79.6 GB on-disk
	- Runs on: M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+

	## Why top-6 + 2-bit holds

	DSV4-Flash routes through 6 of 256 experts per token plus 1 always-on
	shared expert and 3 hash layers — so per-token output averages
	codebook noise across 7+ pathways. That's a much weaker quality
	constraint than top-1 architectures (where every token rides a single
	expert's quant error). MiniMax (top-2) and Hy3-preview (top-8) both
	ship coherent uniform JANGTQ2; DSV4 sits between them.

	## Loading (Python)

	```bash
	pip install jang-tools mlx-lm
	```

	```python
	from jang_tools.load_jangtq import load_jangtq_model

	model, tokenizer = load_jangtq_model("OsaurusAI/DeepSeek-V4-Flash-JANGTQ2")

	chat = tokenizer.apply_chat_template(
	[{{"role": "user", "content": "What is 2 + 2? Answer briefly."}}],
	tokenize=False,
	add_generation_prompt=True,
	)
	```

	`load_jangtq_model` auto-registers `model_type=deepseek_v4` via
	`jang_tools.dsv4` before building the MLX skeleton. The loader applies
	the DSV4-specific MLA absorb + fp32 SDPA + mHC + Compressor + Indexer
	patches automatically.

	## Runtime support matrix

	\| Surface \| Status \|
	\|---\|---\|
	\| `jang-tools` Python (`load_jangtq_model`) \| ✅ working \|
	\| `vmlx-swift-lm` Swift \| ✅ working — `DeepseekV4JANGTQ` family path \|
	\| MTP speculative decode \| preserved-disabled — weights present (variant=std); accept/reject loop not yet in any JANG runtime \|

	## Validated runtime contract

	- 43 transformer layers + 1 MTP head materialize; 40 sparse-MoE layers
	hydrate routed experts via TurboQuantLinear (2-bit MXTQ).
	- 33,792 MXTQ tensors / 522 affine / 706 passthrough.
	- Capabilities: `family=deepseek_v4`, `reasoning_parser=deepseek_r1`,
	`tool_parser=dsml`, `think_in_template=True`, `cache_type=mla`.

	## Reasoning + tools

	- Reasoning parser: `deepseek_r1`
	- Tool parser: `dsml` (DeepSeek Markup Language — distinct from
	`deepseek_tool_parser`; see `~/jang/research/DSV4-EVAL-NUANCES.md`)
	- Reasoning template: `<｜thinking_begin｜>...<｜thinking_end｜>` blocks
	via `enable_thinking=True` (default off — pass-through chat mode).
	Greedy `T=0` with `enable_thinking=True` collapses into repetition on
	DSV4; use `T=0.6` for pass@1 like the original DeepSeek release.
	- Cache: `mla` (Multi-head Latent Attention with kv_lora_rank=512)

	## Credits

	- Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
	- Source model: DeepSeek AI
	- License: MIT, inherited from upstream