Official Qwen qwen/qwen3-4b-2507

  • MMLU-Pro: 44.70, 45.20, 44.20
  • GPQA Diamond: 37.37, 36.36, 37.37

SDFT model Qwen3-4B-2507-Heretic-SDFT-v1

  • MMLU-Pro: 44.70, 44.70, 44.10
  • GPQA Diamond: 37.37, 37.37, 36.87

Heretic Qwen3-4B-Instruct-2507-heretic-v1.2

  • MMLU-Pro: 44.20, 45.20, 44.40
  • GPQA Diamond: 40.40, 39.39, 36.36

Summary Table

Model MMLU-Pro mean MMLU-Pro std GPQA mean GPQA std
qwen/qwen3-4b-2507 44.70% 0.50 pp 37.04% 0.58 pp
Qwen3-4B-2507-Heretic-SDFT-v1 44.50% 0.35 pp 37.21% 0.29 pp
Qwen3-4B-Instruct-2507-heretic-v1.2 44.60% 0.53 pp 38.72% 2.10 pp

Interpretation

MMLU-Pro

The three models are effectively tied on the MMLU-Pro sample.

  • official: 44.70%
  • SDFT: 44.50%
  • heretic: 44.60%

The between-model differences are smaller than one percentage point and are on the same scale as run-to-run variation. On this sample, there is no evidence of a meaningful capability gap in general knowledge/reasoning between the three models.

GPQA Diamond

GPQA Diamond separates the models more clearly.

  • official: 37.04%
  • SDFT: 37.21%
  • heretic: 38.72%

The heretic model is highest on mean GPQA accuracy, but it also has the largest variance across runs. The SDFT model and the official quant are nearly indistinguishable on mean GPQA score.

Downloads last month
3
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Ilya626/Qwen3-4B-2507-Heretic-SDFT-v1

Quantized
(229)
this model