j1-micro / README.md
rachittshah's picture
Upload folder using huggingface_hub
4898eb0 verified
metadata
base_model: Qwen/Qwen3-1.7B
library_name: mlx
datasets:
  - Skywork/Skywork-Reward-Preference-80K-v0.2
  - allenai/reward-bench
language:
  - en
license: apache-2.0
pipeline_tag: text-generation
tags:
  - mlx
  - reward-model
  - judge-model
  - grpo
  - lora
  - spct
  - apple-silicon
model-index:
  - name: j1-micro
    results:
      - task:
          type: reward-modeling
          name: Reward Modeling
        dataset:
          name: RewardBench
          type: allenai/reward-bench
        metrics:
          - type: accuracy
            value: 80.7
            name: RewardBench Accuracy (reported by Haize Labs)
      - task:
          type: reward-modeling
          name: Reward Modeling (MLX 4-bit)
        dataset:
          name: RewardBench (100-sample subset)
          type: allenai/reward-bench
        metrics:
          - type: accuracy
            value: 75
            name: RewardBench Accuracy (MLX 4-bit quantized)

j1-micro-1.7B (MLX 4-bit Quantized)

MLX 4-bit quantized version of Haize Labs' j1-micro, a 1.7B judge/reward model that matches Claude-3-Opus and GPT-4o-mini on RewardBench (80.7%) despite being 100x smaller.

This repo contains the MLX 4-bit quantized weights for fast inference on Apple Silicon Macs, plus the original LoRA adapter for GPU inference via vLLM.

What This Model Does

j1-micro is a pairwise preference judge: given two responses, it generates a structured rubric, reasons through it, and scores each response. Trained with GRPO (Group Relative Policy Optimization) + SPCT (Self-Principled Critique Tuning) on Skywork Preference 80K.

The model invents its own evaluation criteria per query, then scores against them. This structured reasoning is why 1.7B beats 400B+ models.

Performance

Model Params RewardBench
Tulu-2-70b 70B 77.2%
Llama-3-70B-Instruct 70B 77.0%
Claude-3-Opus 200B+ 80.1%
GPT-4o-mini ~8B 80.1%
j1-micro (LoRA, FP16) 1.7B 80.7%
j1-micro (MLX 4-bit) 1.7B 75.0%

MLX 4-bit quantized performance on 100-sample RewardBench subset:

  • Accuracy: 75.0% (0% format error rate)
  • Latency: ~3.0s avg, 2.9s p50, 3.8s p95 (M-series Mac)
  • Memory: 2.0 GB peak

Files

mlx/                     # MLX 4-bit quantized (Apple Silicon)
  model.safetensors      # 968 MB
  config.json
  tokenizer.json
  tokenizer_config.json
  ...
lora/                    # LoRA adapter (GPU via vLLM/PEFT)
  adapter_model.safetensors  # 67 MB
  adapter_config.json
  tokenizer.json
  ...

Quick Start (MLX on Mac)

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("rachittshah/j1-micro", model_config={"subfolder": "mlx"})

SYSTEM = """You are an expert XML wrangler. You must respond in the following format:
<specific_criteria>...</specific_criteria>
<analysis>...</analysis>
<scores>\\boxed{..., ...}</scores>
Please only respond in English."""

prompt = """You are a skilled little expert at scoring responses...
#### Conversation Context ####
What is the capital of France?
#### Responses to be Scored ####
[The Begin of Response A]
The capital of France is Paris, located in northern France along the Seine River.
[The End of Response A]
[The Begin of Response B]
France's capital is Lyon, a major city in southeastern France.
[The End of Response B]"""

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": prompt},
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=formatted, max_tokens=2048)
print(response)

Quick Start (vLLM with LoRA)

# Download and serve with vLLM
vllm serve Qwen/Qwen3-1.7B \
  --enable-lora \
  --lora-modules j1-micro=rachittshah/j1-micro/lora

# Or load adapter with PEFT
from peft import PeftModel
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B")
model = PeftModel.from_pretrained(model, "rachittshah/j1-micro", subfolder="lora")

Output Format

The model outputs structured XML:

<specific_criteria>
1. Factual accuracy (weight: 0.35) — correctness of stated facts
2. Specificity (weight: 0.25) — concrete details vs vague claims
3. Completeness (weight: 0.2) — coverage of the topic
4. Clarity (weight: 0.2) — clear, well-organized explanation
</specific_criteria>
<analysis>
Response A: Factual accuracy 9/10 — correctly identifies Paris...
Response B: Factual accuracy 2/10 — incorrectly states Lyon...
</analysis>
<scores>
\boxed{8, 3}
</scores>

Training Details

  • Base model: Qwen/Qwen3-1.7B (Apache 2.0)
  • Method: GRPO + SPCT (Self-Principled Critique Tuning)
  • Data: Skywork-Reward-Preference-80K-v0.2
  • LoRA: rank=16, alpha=32, dropout=0.1, all attention + MLP projections
  • Hardware: 1x A100 80GB, <24h training
  • Cost: ~$25

Citation

Original model by Haize Labs:

@misc{j1micro2025,
    title = {j1-micro and j1-nano: Tiny Generalist Reward Models via Inference-Time Rubric Proposal},
    author = {Haize Labs},
    url = {https://github.com/haizelabs/j1-micro},
    month = {May},
    year = {2025}
}

License

Apache 2.0 (both base model Qwen3-1.7B and LoRA adapter).