Text Generation
MLX
Safetensors
deepseek_v4
jang
jangtq
jangtq2
jangtq-prestack
mxtq
deepseek
deepseek-v4
deepseek-v4-flash
Mixture of Experts
mla
hash-layers
mtp
apple-silicon
osaurus
Instructions to use OsaurusAI/DeepSeek-V4-Flash-JANGTQ2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use OsaurusAI/DeepSeek-V4-Flash-JANGTQ2 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("OsaurusAI/DeepSeek-V4-Flash-JANGTQ2") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- MLX LM
How to use OsaurusAI/DeepSeek-V4-Flash-JANGTQ2 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "OsaurusAI/DeepSeek-V4-Flash-JANGTQ2" --prompt "Once upon a time"
| license: mit | |
| license_name: deepseek-license | |
| library_name: mlx | |
| base_model: deepseek-ai/DeepSeek-V4-Flash | |
| base_model_relation: quantized | |
| pipeline_tag: text-generation | |
| tags: | |
| - mlx | |
| - jang | |
| - jangtq | |
| - jangtq2 | |
| - jangtq-prestack | |
| - mxtq | |
| - deepseek | |
| - deepseek-v4 | |
| - deepseek-v4-flash | |
| - moe | |
| - mla | |
| - hash-layers | |
| - mtp | |
| - apple-silicon | |
| - osaurus | |
| <p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p> | |
| # DeepSeek-V4-Flash-JANGTQ2 | |
| **DeepSeek-V4-Flash — 79.6 GB on disk** (down from 149 GB FP4+FP8 source) — | |
| uniform **2-bit JANGTQ** quantization on routed experts + 8-bit affine on | |
| everything else + preserved MTP head. | |
| - **Source:** [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | |
| (43 transformer layers + 1 MTP head, **256 routed experts top-6 + 1 | |
| shared expert**, **3 hash layers**, MLA + mHC residuals, ~284 B total) | |
| - **Quantization:** uniform **2-bit MXTQ** on routed-expert MLP + | |
| 8-bit affine on attention (`wq_a/wq_b/wkv/wo_a/wo_b`) / shared | |
| expert / Compressor / Indexer / embed / lm_head / MTP. RMSNorms, | |
| router gate, mHC fn matrices, attn_sink, ape stay fp16/fp32 | |
| passthrough. | |
| - **Variant:** `std` (preserves MTP layer 43; one-token-per-forward | |
| until a JANG runtime ships the accept/reject speculative-decode loop). | |
| The companion `DeepSeek-V4-Flash-JANGTQ-K` variant drops MTP for a | |
| smaller bundle. | |
| - **Routed-expert layout:** **pre-stacked** along axis 0 under | |
| `ffn.experts.switch_mlp.{{gate_proj, up_proj, down_proj}}` per the | |
| JANGTQ-PRESTACK STANDARD. Sidecar `jangtq_runtime.safetensors` | |
| (~24 KB) ships both `(in=2048, bits=2)` and `(in=4096, bits=2)` | |
| codebooks + sign-flip vectors for Swift runtimes. | |
| - **Bundle size:** **~79.6 GB on-disk** | |
| - **Runs on:** M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+ | |
| ## Why top-6 + 2-bit holds | |
| DSV4-Flash routes through **6 of 256 experts per token** plus 1 always-on | |
| shared expert and 3 hash layers — so per-token output averages | |
| codebook noise across 7+ pathways. That's a much weaker quality | |
| constraint than top-1 architectures (where every token rides a single | |
| expert's quant error). MiniMax (top-2) and Hy3-preview (top-8) both | |
| ship coherent uniform JANGTQ2; DSV4 sits between them. | |
| ## Loading (Python) | |
| ```bash | |
| pip install jang-tools mlx-lm | |
| ``` | |
| ```python | |
| from jang_tools.load_jangtq import load_jangtq_model | |
| model, tokenizer = load_jangtq_model("OsaurusAI/DeepSeek-V4-Flash-JANGTQ2") | |
| chat = tokenizer.apply_chat_template( | |
| [{{"role": "user", "content": "What is 2 + 2? Answer briefly."}}], | |
| tokenize=False, | |
| add_generation_prompt=True, | |
| ) | |
| ``` | |
| `load_jangtq_model` auto-registers `model_type=deepseek_v4` via | |
| `jang_tools.dsv4` before building the MLX skeleton. The loader applies | |
| the DSV4-specific MLA absorb + fp32 SDPA + mHC + Compressor + Indexer | |
| patches automatically. | |
| ## Runtime support matrix | |
| | Surface | Status | | |
| |---|---| | |
| | `jang-tools` Python (`load_jangtq_model`) | ✅ working | | |
| | `vmlx-swift-lm` Swift | ✅ working — `DeepseekV4JANGTQ` family path | | |
| | MTP speculative decode | preserved-disabled — weights present (variant=std); accept/reject loop not yet in any JANG runtime | | |
| ## Validated runtime contract | |
| - 43 transformer layers + 1 MTP head materialize; 40 sparse-MoE layers | |
| hydrate routed experts via TurboQuantLinear (2-bit MXTQ). | |
| - 33,792 MXTQ tensors / 522 affine / 706 passthrough. | |
| - Capabilities: `family=deepseek_v4`, `reasoning_parser=deepseek_r1`, | |
| `tool_parser=dsml`, `think_in_template=True`, `cache_type=mla`. | |
| ## Reasoning + tools | |
| - **Reasoning parser:** `deepseek_r1` | |
| - **Tool parser:** `dsml` (DeepSeek Markup Language — distinct from | |
| `deepseek_tool_parser`; see `~/jang/research/DSV4-EVAL-NUANCES.md`) | |
| - **Reasoning template:** `<|thinking_begin|>...<|thinking_end|>` blocks | |
| via `enable_thinking=True` (default off — pass-through chat mode). | |
| Greedy `T=0` with `enable_thinking=True` collapses into repetition on | |
| DSV4; use `T=0.6` for pass@1 like the original DeepSeek release. | |
| - **Cache:** `mla` (Multi-head Latent Attention with kv_lora_rank=512) | |
| ## Credits | |
| - **Quantization + MLX runtime:** Jinho Jang (eric@osaurus.ai) | |
| - **Source model:** DeepSeek AI | |
| - **License:** MIT, inherited from upstream | |