Gemma 4 26B-A4B-it — JANG_4M (MoE, 4-bit)

JANG — Jang Adaptive N-bit Grading | Mixed-Precision Quantization for Apple Silicon

Osaurus natively supports JANG models. Download at osaurus.ai.

Results (200-question MMLU, no-thinking)

Model	MMLU	Size	Speed
JANG_4M (4-bit)	69.5%	15 GB	26.7 tok/s
MLX 4-bit	70.5%	15 GB	25.7 tok/s
JANG_2L (2-bit)	58.0%	9.9 GB	30.8 tok/s
MLX 2-bit	Broken — completely incoherent output	~7 GB	—

JANG_4M matches MLX 4-bit quality (69.5% vs 70.5%) at identical size — with slightly faster inference. At 4-bit, JANG's mixed-precision strategy puts 8-bit on attention/routing while keeping experts at 4-bit, delivering the same accuracy with better architectural protection of critical pathways.

Note: Standard MLX 2-bit quantization on Gemma 4 produces completely incoherent, unusable output. Only JANG's mixed-precision approach makes 2-bit viable on this architecture.

Per-Subject Breakdown

Subject	JANG_4M	MLX 4-bit	JANG_2L
Abstract Algebra	9/20	8/20	6/20
Anatomy	13/20	13/20	13/20
Astronomy	17/20	17/20	14/20
College CS	13/20	14/20	9/20
College Physics	14/20	14/20	11/20
HS Biology	19/20	18/20	18/20
HS Chemistry	14/20	15/20	7/20
HS Mathematics	6/20	7/20	7/20
Logical Fallacies	17/20	19/20	16/20
World Religions	17/20	16/20	15/20
Total	139/200	141/200	116/200

Model Details

Metric	Value
Source	google/gemma-4-26b-a4b-it
Architecture	MoE (128 experts, top-8 active) + Hybrid Sliding/Global Attention
Profile	JANG_4M (CRITICAL=8-bit, IMPORTANT=4-bit, COMPRESS=4-bit)
Actual avg bits	4.26
Model size	15 GB (vs ~50 GB bf16)
Vision	Yes (multimodal, float16 passthrough)
Format	JANG v2 (MLX-native safetensors, instant load)
Parameters	70.2B total, ~4B active per token

Architecture Highlights

128 MoE experts with top-8 routing + parallel shared dense MLP
Hybrid attention: 25 sliding-window layers + 5 full-attention layers
Dual head dimensions: 256 (sliding) / 512 (global)
K=V weight sharing on global attention layers
Vision encoder preserved in float16 for multimodal inference

JANG_4M Bit Allocation

Tier	Components	Bits
CRITICAL	Attention (Q/K/V/O), router, shared MLP, embeddings	8
IMPORTANT	Gate proj, up proj	4
COMPRESS	Expert MLP (down proj), remaining weights	4

JANG_4M provides a balanced quality-size tradeoff — attention and routing at 8-bit precision ensures coherent generation while 4-bit experts keep the model compact enough for 16 GB MacBooks.

Install

pip install "jang[mlx]"

For vision:

pip install "jang[vlm]"

Quick Start

from jang_tools.loader import load_jang_model
from mlx_lm.sample_utils import make_sampler
from mlx_lm.generate import generate_step
import mlx.core as mx

model, tokenizer = load_jang_model("OsaurusAI/Gemma-4-26B-A4B-it-JANG_4M")
sampler = make_sampler(temp=0.7)

tokens = tokenizer.encode("Explain quantum computing in simple terms.")
for tok, _ in generate_step(prompt=mx.array(tokens), model=model, max_tokens=200, sampler=sampler):
    t = tok.item() if hasattr(tok, 'item') else int(tok)
    print(tokenizer.decode([t]), end="", flush=True)
    if t == tokenizer.eos_token_id:
        break

VLM Inference

from jang_tools.loader import load_jang_vlm_model
from mlx_vlm import generate

model, processor = load_jang_vlm_model("OsaurusAI/Gemma-4-26B-A4B-it-JANG_4M")

prompt = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": [
        {"type": "image", "image": "photo.jpg"},
        {"type": "text", "text": "Describe this image."}
    ]}], add_generation_prompt=True, tokenize=False)

result = generate(model, processor, prompt, ["photo.jpg"], max_tokens=200)
print(result.text)

OsaurusAI
/

Gemma-4-26B-A4B-it-JANG_4M