j1-micro-1.7B (MLX 4-bit Quantized)

MLX 4-bit quantized version of Haize Labs' j1-micro, a 1.7B judge/reward model that matches Claude-3-Opus and GPT-4o-mini on RewardBench (80.7%) despite being 100x smaller.

This repo contains the MLX 4-bit quantized weights for fast inference on Apple Silicon Macs, plus the original LoRA adapter for GPU inference via vLLM.

What This Model Does

j1-micro is a pairwise preference judge: given two responses, it generates a structured rubric, reasons through it, and scores each response. Trained with GRPO (Group Relative Policy Optimization) + SPCT (Self-Principled Critique Tuning) on Skywork Preference 80K.

The model invents its own evaluation criteria per query, then scores against them. This structured reasoning is why 1.7B beats 400B+ models.

Performance

Model Params RewardBench
Tulu-2-70b 70B 77.2%
Llama-3-70B-Instruct 70B 77.0%
Claude-3-Opus 200B+ 80.1%
GPT-4o-mini ~8B 80.1%
j1-micro (LoRA, FP16) 1.7B 80.7%
j1-micro (MLX 4-bit) 1.7B 75.0%

MLX 4-bit quantized performance on 100-sample RewardBench subset:

  • Accuracy: 75.0% (0% format error rate)
  • Latency: ~3.0s avg, 2.9s p50, 3.8s p95 (M-series Mac)
  • Memory: 2.0 GB peak

Files

mlx/                     # MLX 4-bit quantized (Apple Silicon)
  model.safetensors      # 968 MB
  config.json
  tokenizer.json
  tokenizer_config.json
  ...
lora/                    # LoRA adapter (GPU via vLLM/PEFT)
  adapter_model.safetensors  # 67 MB
  adapter_config.json
  tokenizer.json
  ...

Quick Start (MLX on Mac)

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("rachittshah/j1-micro", model_config={"subfolder": "mlx"})

SYSTEM = """You are an expert XML wrangler. You must respond in the following format:
<specific_criteria>...</specific_criteria>
<analysis>...</analysis>
<scores>\\boxed{..., ...}</scores>
Please only respond in English."""

prompt = """You are a skilled little expert at scoring responses...
#### Conversation Context ####
What is the capital of France?
#### Responses to be Scored ####
[The Begin of Response A]
The capital of France is Paris, located in northern France along the Seine River.
[The End of Response A]
[The Begin of Response B]
France's capital is Lyon, a major city in southeastern France.
[The End of Response B]"""

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": prompt},
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=formatted, max_tokens=2048)
print(response)

Quick Start (vLLM with LoRA)

# Download and serve with vLLM
vllm serve Qwen/Qwen3-1.7B \
  --enable-lora \
  --lora-modules j1-micro=rachittshah/j1-micro/lora

# Or load adapter with PEFT
from peft import PeftModel
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B")
model = PeftModel.from_pretrained(model, "rachittshah/j1-micro", subfolder="lora")

Output Format

The model outputs structured XML:

<specific_criteria>
1. Factual accuracy (weight: 0.35) — correctness of stated facts
2. Specificity (weight: 0.25) — concrete details vs vague claims
3. Completeness (weight: 0.2) — coverage of the topic
4. Clarity (weight: 0.2) — clear, well-organized explanation
</specific_criteria>
<analysis>
Response A: Factual accuracy 9/10 — correctly identifies Paris...
Response B: Factual accuracy 2/10 — incorrectly states Lyon...
</analysis>
<scores>
\boxed{8, 3}
</scores>

Training Details

  • Base model: Qwen/Qwen3-1.7B (Apache 2.0)
  • Method: GRPO + SPCT (Self-Principled Critique Tuning)
  • Data: Skywork-Reward-Preference-80K-v0.2
  • LoRA: rank=16, alpha=32, dropout=0.1, all attention + MLP projections
  • Hardware: 1x A100 80GB, <24h training
  • Cost: ~$25

Citation

Original model by Haize Labs:

@misc{j1micro2025,
    title = {j1-micro and j1-nano: Tiny Generalist Reward Models via Inference-Time Rubric Proposal},
    author = {Haize Labs},
    url = {https://github.com/haizelabs/j1-micro},
    month = {May},
    year = {2025}
}

License

Apache 2.0 (both base model Qwen3-1.7B and LoRA adapter).

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rachittshah/j1-micro

Finetuned
Qwen/Qwen3-1.7B
Adapter
(401)
this model

Datasets used to train rachittshah/j1-micro

Evaluation results

  • RewardBench Accuracy (reported by Haize Labs) on RewardBench
    self-reported
    80.700
  • RewardBench Accuracy (MLX 4-bit quantized) on RewardBench (100-sample subset)
    self-reported
    75.000