Osaurus AI

Gemma 4 26B-A4B-it — JANG_4M (MoE, 4-bit)

JANG — Jang Adaptive N-bit Grading | Mixed-Precision Quantization for Apple Silicon

Website  GitHub  PyPI  JANGQ-AI


Osaurus natively supports JANG models. Download at osaurus.ai.


Results (200-question MMLU, no-thinking)

Model MMLU Size Speed
JANG_4M (4-bit) 69.5% 15 GB 26.7 tok/s
MLX 4-bit 70.5% 15 GB 25.7 tok/s
JANG_2L (2-bit) 58.0% 9.9 GB 30.8 tok/s
MLX 2-bit Broken — completely incoherent output ~7 GB

JANG_4M matches MLX 4-bit quality (69.5% vs 70.5%) at identical size — with slightly faster inference. At 4-bit, JANG's mixed-precision strategy puts 8-bit on attention/routing while keeping experts at 4-bit, delivering the same accuracy with better architectural protection of critical pathways.

Note: Standard MLX 2-bit quantization on Gemma 4 produces completely incoherent, unusable output. Only JANG's mixed-precision approach makes 2-bit viable on this architecture.

Per-Subject Breakdown

Subject JANG_4M MLX 4-bit JANG_2L
Abstract Algebra 9/20 8/20 6/20
Anatomy 13/20 13/20 13/20
Astronomy 17/20 17/20 14/20
College CS 13/20 14/20 9/20
College Physics 14/20 14/20 11/20
HS Biology 19/20 18/20 18/20
HS Chemistry 14/20 15/20 7/20
HS Mathematics 6/20 7/20 7/20
Logical Fallacies 17/20 19/20 16/20
World Religions 17/20 16/20 15/20
Total 139/200 141/200 116/200

Model Details

Metric Value
Source google/gemma-4-26b-a4b-it
Architecture MoE (128 experts, top-8 active) + Hybrid Sliding/Global Attention
Profile JANG_4M (CRITICAL=8-bit, IMPORTANT=4-bit, COMPRESS=4-bit)
Actual avg bits 4.26
Model size 15 GB (vs ~50 GB bf16)
Vision Yes (multimodal, float16 passthrough)
Format JANG v2 (MLX-native safetensors, instant load)
Parameters 70.2B total, ~4B active per token

Architecture Highlights

  • 128 MoE experts with top-8 routing + parallel shared dense MLP
  • Hybrid attention: 25 sliding-window layers + 5 full-attention layers
  • Dual head dimensions: 256 (sliding) / 512 (global)
  • K=V weight sharing on global attention layers
  • Vision encoder preserved in float16 for multimodal inference

JANG_4M Bit Allocation

Tier Components Bits
CRITICAL Attention (Q/K/V/O), router, shared MLP, embeddings 8
IMPORTANT Gate proj, up proj 4
COMPRESS Expert MLP (down proj), remaining weights 4

JANG_4M provides a balanced quality-size tradeoff — attention and routing at 8-bit precision ensures coherent generation while 4-bit experts keep the model compact enough for 16 GB MacBooks.

Install

pip install "jang[mlx]"

For vision:

pip install "jang[vlm]"

Quick Start

from jang_tools.loader import load_jang_model
from mlx_lm.sample_utils import make_sampler
from mlx_lm.generate import generate_step
import mlx.core as mx

model, tokenizer = load_jang_model("OsaurusAI/Gemma-4-26B-A4B-it-JANG_4M")
sampler = make_sampler(temp=0.7)

tokens = tokenizer.encode("Explain quantum computing in simple terms.")
for tok, _ in generate_step(prompt=mx.array(tokens), model=model, max_tokens=200, sampler=sampler):
    t = tok.item() if hasattr(tok, 'item') else int(tok)
    print(tokenizer.decode([t]), end="", flush=True)
    if t == tokenizer.eos_token_id:
        break

VLM Inference

from jang_tools.loader import load_jang_vlm_model
from mlx_vlm import generate

model, processor = load_jang_vlm_model("OsaurusAI/Gemma-4-26B-A4B-it-JANG_4M")

prompt = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": [
        {"type": "image", "image": "photo.jpg"},
        {"type": "text", "text": "Describe this image."}
    ]}], add_generation_prompt=True, tokenize=False)

result = generate(model, processor, prompt, ["photo.jpg"], max_tokens=200)
print(result.text)

Links


Created by Jinho Jang — jangq.ai · osaurus.ai

Downloads last month
1,357
Safetensors
Model size
5B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support