--- license: mit license_name: deepseek-license library_name: mlx base_model: deepseek-ai/DeepSeek-V4-Flash base_model_relation: quantized pipeline_tag: text-generation tags: - mlx - jang - jangtq - jangtq2 - jangtq-prestack - mxtq - deepseek - deepseek-v4 - deepseek-v4-flash - moe - mla - hash-layers - mtp - apple-silicon - osaurus ---

OsaurusAI

# DeepSeek-V4-Flash-JANGTQ2 **DeepSeek-V4-Flash — 79.6 GB on disk** (down from 149 GB FP4+FP8 source) — uniform **2-bit JANGTQ** quantization on routed experts + 8-bit affine on everything else + preserved MTP head. - **Source:** [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) (43 transformer layers + 1 MTP head, **256 routed experts top-6 + 1 shared expert**, **3 hash layers**, MLA + mHC residuals, ~284 B total) - **Quantization:** uniform **2-bit MXTQ** on routed-expert MLP + 8-bit affine on attention (`wq_a/wq_b/wkv/wo_a/wo_b`) / shared expert / Compressor / Indexer / embed / lm_head / MTP. RMSNorms, router gate, mHC fn matrices, attn_sink, ape stay fp16/fp32 passthrough. - **Variant:** `std` (preserves MTP layer 43; one-token-per-forward until a JANG runtime ships the accept/reject speculative-decode loop). The companion `DeepSeek-V4-Flash-JANGTQ-K` variant drops MTP for a smaller bundle. - **Routed-expert layout:** **pre-stacked** along axis 0 under `ffn.experts.switch_mlp.{{gate_proj, up_proj, down_proj}}` per the JANGTQ-PRESTACK STANDARD. Sidecar `jangtq_runtime.safetensors` (~24 KB) ships both `(in=2048, bits=2)` and `(in=4096, bits=2)` codebooks + sign-flip vectors for Swift runtimes. - **Bundle size:** **~79.6 GB on-disk** - **Runs on:** M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+ ## Why top-6 + 2-bit holds DSV4-Flash routes through **6 of 256 experts per token** plus 1 always-on shared expert and 3 hash layers — so per-token output averages codebook noise across 7+ pathways. That's a much weaker quality constraint than top-1 architectures (where every token rides a single expert's quant error). MiniMax (top-2) and Hy3-preview (top-8) both ship coherent uniform JANGTQ2; DSV4 sits between them. ## Loading (Python) ```bash pip install jang-tools mlx-lm ``` ```python from jang_tools.load_jangtq import load_jangtq_model model, tokenizer = load_jangtq_model("OsaurusAI/DeepSeek-V4-Flash-JANGTQ2") chat = tokenizer.apply_chat_template( [{{"role": "user", "content": "What is 2 + 2? Answer briefly."}}], tokenize=False, add_generation_prompt=True, ) ``` `load_jangtq_model` auto-registers `model_type=deepseek_v4` via `jang_tools.dsv4` before building the MLX skeleton. The loader applies the DSV4-specific MLA absorb + fp32 SDPA + mHC + Compressor + Indexer patches automatically. ## Runtime support matrix | Surface | Status | |---|---| | `jang-tools` Python (`load_jangtq_model`) | ✅ working | | `vmlx-swift-lm` Swift | ✅ working — `DeepseekV4JANGTQ` family path | | MTP speculative decode | preserved-disabled — weights present (variant=std); accept/reject loop not yet in any JANG runtime | ## Validated runtime contract - 43 transformer layers + 1 MTP head materialize; 40 sparse-MoE layers hydrate routed experts via TurboQuantLinear (2-bit MXTQ). - 33,792 MXTQ tensors / 522 affine / 706 passthrough. - Capabilities: `family=deepseek_v4`, `reasoning_parser=deepseek_r1`, `tool_parser=dsml`, `think_in_template=True`, `cache_type=mla`. ## Reasoning + tools - **Reasoning parser:** `deepseek_r1` - **Tool parser:** `dsml` (DeepSeek Markup Language — distinct from `deepseek_tool_parser`; see `~/jang/research/DSV4-EVAL-NUANCES.md`) - **Reasoning template:** `<|thinking_begin|>...<|thinking_end|>` blocks via `enable_thinking=True` (default off — pass-through chat mode). Greedy `T=0` with `enable_thinking=True` collapses into repetition on DSV4; use `T=0.6` for pass@1 like the original DeepSeek release. - **Cache:** `mla` (Multi-head Latent Attention with kv_lora_rank=512) ## Credits - **Quantization + MLX runtime:** Jinho Jang (eric@osaurus.ai) - **Source model:** DeepSeek AI - **License:** MIT, inherited from upstream