BerenMillidge commited on
Commit
e9dcba3
·
verified ·
1 Parent(s): 6e6246b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -3
README.md CHANGED
@@ -1,3 +1,28 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+
6
+
7
+
8
+ ### In-class comparison against open-source reasoning models
9
+
10
+ | Category | Benchmark | Zaya1-8B<br>(0.7B / 8.0B) | Qwen3-4B-Thinking-2507<br>(4.0B / 4.0B) | Qwen3.5-4B<br>(4.0B / 4.0B) | Gemma-4-E4B-it<br>(4.0B / 8.0B*) |
11
+ |---|---|---:|---:|---:|---:|
12
+ | Math | AIME'26 | 89.1 | 77.5 | 84.5 | 50.3 |
13
+ | Math | HMMT Feb.'26 | 71.6 | 60.8 | 63.6 | 32.1 |
14
+ | Math | IMO-AnswerBench | 59.3 | 50.9 | 48.7 | 27.3 |
15
+ | Math | APEX-shortlist | 32.2 | 16.9 | -- | 6.1 |
16
+ | Code | LiveCodeBench-v6 | 65.8 | 54.2 | -- | 54.2 |
17
+ | Knowledge | GPQA-Diamond | 71.0 | 66.5 | 76.2 | 57.4 |
18
+ | Knowledge | MMLU-Pro | 74.2 | 74.3 | 79.1 | 70.2 |
19
+ | Instruction | IFEval | 85.58 | 86.8 | 89.8 | 88.50 |
20
+ | Instruction | IFBench | 52.56 | 52.9 | 59.2 | 42.67 |
21
+ | Style & chat | EQBench | 72.95 | 79.6 | 79.5 | 80.15 |
22
+ | Style & chat | Creative Writing v3 | 62.97 | 58.6 | 72.9 | 83.75 |
23
+ | Agentic | BFCL-v4 | 39.22 | 49.7 | 45.2 | 31.7 |
24
+ | Agentic | τ² | 43.12 | 52.9 | 82.1 | 37.7 |
25
+
26
+ \* Gemma-4-E4B-it includes 4B additional embedding parameters as part of its total.
27
+
28
+ > RW: Zaya1-8B numbers in this draft table are from the math+code+TTC soup checkpoint before final behavioral RL and should be refreshed after the final checkpoint is selected.