Official Qwen qwen/qwen3-4b-2507
- MMLU-Pro:
44.70,45.20,44.20 - GPQA Diamond:
37.37,36.36,37.37
SDFT model Qwen3-4B-2507-Heretic-SDFT-v1
- MMLU-Pro:
44.70,44.70,44.10 - GPQA Diamond:
37.37,37.37,36.87
Heretic Qwen3-4B-Instruct-2507-heretic-v1.2
- MMLU-Pro:
44.20,45.20,44.40 - GPQA Diamond:
40.40,39.39,36.36
Summary Table
| Model | MMLU-Pro mean | MMLU-Pro std | GPQA mean | GPQA std |
|---|---|---|---|---|
qwen/qwen3-4b-2507 |
44.70% |
0.50 pp |
37.04% |
0.58 pp |
Qwen3-4B-2507-Heretic-SDFT-v1 |
44.50% |
0.35 pp |
37.21% |
0.29 pp |
Qwen3-4B-Instruct-2507-heretic-v1.2 |
44.60% |
0.53 pp |
38.72% |
2.10 pp |
Interpretation
MMLU-Pro
The three models are effectively tied on the MMLU-Pro sample.
- official:
44.70% - SDFT:
44.50% - heretic:
44.60%
The between-model differences are smaller than one percentage point and are on the same scale as run-to-run variation. On this sample, there is no evidence of a meaningful capability gap in general knowledge/reasoning between the three models.
GPQA Diamond
GPQA Diamond separates the models more clearly.
- official:
37.04% - SDFT:
37.21% - heretic:
38.72%
The heretic model is highest on mean GPQA accuracy, but it also has the largest variance across runs. The SDFT model and the official quant are nearly indistinguishable on mean GPQA score.
- Downloads last month
- 3
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for Ilya626/Qwen3-4B-2507-Heretic-SDFT-v1
Base model
Qwen/Qwen3-4B-Instruct-2507