Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,28 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
### In-class comparison against open-source reasoning models
|
| 9 |
+
|
| 10 |
+
| Category | Benchmark | Zaya1-8B<br>(0.7B / 8.0B) | Qwen3-4B-Thinking-2507<br>(4.0B / 4.0B) | Qwen3.5-4B<br>(4.0B / 4.0B) | Gemma-4-E4B-it<br>(4.0B / 8.0B*) |
|
| 11 |
+
|---|---|---:|---:|---:|---:|
|
| 12 |
+
| Math | AIME'26 | 89.1 | 77.5 | 84.5 | 50.3 |
|
| 13 |
+
| Math | HMMT Feb.'26 | 71.6 | 60.8 | 63.6 | 32.1 |
|
| 14 |
+
| Math | IMO-AnswerBench | 59.3 | 50.9 | 48.7 | 27.3 |
|
| 15 |
+
| Math | APEX-shortlist | 32.2 | 16.9 | -- | 6.1 |
|
| 16 |
+
| Code | LiveCodeBench-v6 | 65.8 | 54.2 | -- | 54.2 |
|
| 17 |
+
| Knowledge | GPQA-Diamond | 71.0 | 66.5 | 76.2 | 57.4 |
|
| 18 |
+
| Knowledge | MMLU-Pro | 74.2 | 74.3 | 79.1 | 70.2 |
|
| 19 |
+
| Instruction | IFEval | 85.58 | 86.8 | 89.8 | 88.50 |
|
| 20 |
+
| Instruction | IFBench | 52.56 | 52.9 | 59.2 | 42.67 |
|
| 21 |
+
| Style & chat | EQBench | 72.95 | 79.6 | 79.5 | 80.15 |
|
| 22 |
+
| Style & chat | Creative Writing v3 | 62.97 | 58.6 | 72.9 | 83.75 |
|
| 23 |
+
| Agentic | BFCL-v4 | 39.22 | 49.7 | 45.2 | 31.7 |
|
| 24 |
+
| Agentic | τ² | 43.12 | 52.9 | 82.1 | 37.7 |
|
| 25 |
+
|
| 26 |
+
\* Gemma-4-E4B-it includes 4B additional embedding parameters as part of its total.
|
| 27 |
+
|
| 28 |
+
> RW: Zaya1-8B numbers in this draft table are from the math+code+TTC soup checkpoint before final behavioral RL and should be refreshed after the final checkpoint is selected.
|