Gemma 4 26B-A4B-it — JANG_4M (MoE, 4-bit)
JANG — Jang Adaptive N-bit Grading | Mixed-Precision Quantization for Apple Silicon
Osaurus natively supports JANG models. Download at osaurus.ai.
Results (200-question MMLU, no-thinking)
| Model | MMLU | Size | Speed |
|---|---|---|---|
| JANG_4M (4-bit) | 69.5% | 15 GB | 26.7 tok/s |
| MLX 4-bit | 70.5% | 15 GB | 25.7 tok/s |
| JANG_2L (2-bit) | 58.0% | 9.9 GB | 30.8 tok/s |
| MLX 2-bit | Broken — completely incoherent output | ~7 GB | — |
JANG_4M matches MLX 4-bit quality (69.5% vs 70.5%) at identical size — with slightly faster inference. At 4-bit, JANG's mixed-precision strategy puts 8-bit on attention/routing while keeping experts at 4-bit, delivering the same accuracy with better architectural protection of critical pathways.
Note: Standard MLX 2-bit quantization on Gemma 4 produces completely incoherent, unusable output. Only JANG's mixed-precision approach makes 2-bit viable on this architecture.
Per-Subject Breakdown
| Subject | JANG_4M | MLX 4-bit | JANG_2L |
|---|---|---|---|
| Abstract Algebra | 9/20 | 8/20 | 6/20 |
| Anatomy | 13/20 | 13/20 | 13/20 |
| Astronomy | 17/20 | 17/20 | 14/20 |
| College CS | 13/20 | 14/20 | 9/20 |
| College Physics | 14/20 | 14/20 | 11/20 |
| HS Biology | 19/20 | 18/20 | 18/20 |
| HS Chemistry | 14/20 | 15/20 | 7/20 |
| HS Mathematics | 6/20 | 7/20 | 7/20 |
| Logical Fallacies | 17/20 | 19/20 | 16/20 |
| World Religions | 17/20 | 16/20 | 15/20 |
| Total | 139/200 | 141/200 | 116/200 |
Model Details
| Metric | Value |
|---|---|
| Source | google/gemma-4-26b-a4b-it |
| Architecture | MoE (128 experts, top-8 active) + Hybrid Sliding/Global Attention |
| Profile | JANG_4M (CRITICAL=8-bit, IMPORTANT=4-bit, COMPRESS=4-bit) |
| Actual avg bits | 4.26 |
| Model size | 15 GB (vs ~50 GB bf16) |
| Vision | Yes (multimodal, float16 passthrough) |
| Format | JANG v2 (MLX-native safetensors, instant load) |
| Parameters | 70.2B total, ~4B active per token |
Architecture Highlights
- 128 MoE experts with top-8 routing + parallel shared dense MLP
- Hybrid attention: 25 sliding-window layers + 5 full-attention layers
- Dual head dimensions: 256 (sliding) / 512 (global)
- K=V weight sharing on global attention layers
- Vision encoder preserved in float16 for multimodal inference
JANG_4M Bit Allocation
| Tier | Components | Bits |
|---|---|---|
| CRITICAL | Attention (Q/K/V/O), router, shared MLP, embeddings | 8 |
| IMPORTANT | Gate proj, up proj | 4 |
| COMPRESS | Expert MLP (down proj), remaining weights | 4 |
JANG_4M provides a balanced quality-size tradeoff — attention and routing at 8-bit precision ensures coherent generation while 4-bit experts keep the model compact enough for 16 GB MacBooks.
Install
pip install "jang[mlx]"
For vision:
pip install "jang[vlm]"
Quick Start
from jang_tools.loader import load_jang_model
from mlx_lm.sample_utils import make_sampler
from mlx_lm.generate import generate_step
import mlx.core as mx
model, tokenizer = load_jang_model("OsaurusAI/Gemma-4-26B-A4B-it-JANG_4M")
sampler = make_sampler(temp=0.7)
tokens = tokenizer.encode("Explain quantum computing in simple terms.")
for tok, _ in generate_step(prompt=mx.array(tokens), model=model, max_tokens=200, sampler=sampler):
t = tok.item() if hasattr(tok, 'item') else int(tok)
print(tokenizer.decode([t]), end="", flush=True)
if t == tokenizer.eos_token_id:
break
VLM Inference
from jang_tools.loader import load_jang_vlm_model
from mlx_vlm import generate
model, processor = load_jang_vlm_model("OsaurusAI/Gemma-4-26B-A4B-it-JANG_4M")
prompt = processor.tokenizer.apply_chat_template(
[{"role": "user", "content": [
{"type": "image", "image": "photo.jpg"},
{"type": "text", "text": "Describe this image."}
]}], add_generation_prompt=True, tokenize=False)
result = generate(model, processor, prompt, ["photo.jpg"], max_tokens=200)
print(result.text)
Links
Created by Jinho Jang — jangq.ai · osaurus.ai
- Downloads last month
- 1,357
Quantized