BerenMillidge commited on
Commit
6cd227b
·
verified ·
1 Parent(s): e9dcba3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -2
README.md CHANGED
@@ -3,7 +3,13 @@ license: apache-2.0
3
  ---
4
 
5
 
 
6
 
 
 
 
 
 
7
 
8
  ### In-class comparison against open-source reasoning models
9
 
@@ -23,6 +29,19 @@ license: apache-2.0
23
  | Agentic | BFCL-v4 | 39.22 | 49.7 | 45.2 | 31.7 |
24
  | Agentic | τ² | 43.12 | 52.9 | 82.1 | 37.7 |
25
 
26
- \* Gemma-4-E4B-it includes 4B additional embedding parameters as part of its total.
27
 
28
- > RW: Zaya1-8B numbers in this draft table are from the math+code+TTC soup checkpoint before final behavioral RL and should be refreshed after the final checkpoint is selected.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
 
6
+ ## Performance
7
 
8
+ Zaya1-8B performs extremely strongly, especially in challenging mathematical, reasoning, and coding benchmarks. Zaya1-8B is competitive with models several times its own size.
9
+
10
+ ![zaya1_scaling_barchart_v3](https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/K7EnqZ1nYX_OJBVqvs8tM.png)
11
+
12
+ First we compare Zaya1-8B to the SOTA Qwen3 and Qwen3.5 model series of approximately the same parameter count as well as the recently released Gemma4 models and secondly to a variety of larger open-weights models.
13
 
14
  ### In-class comparison against open-source reasoning models
15
 
 
29
  | Agentic | BFCL-v4 | 39.22 | 49.7 | 45.2 | 31.7 |
30
  | Agentic | τ² | 43.12 | 52.9 | 82.1 | 37.7 |
31
 
 
32
 
33
+ ### Scaling comparison against larger open-source reasoning models
34
+
35
+
36
+ | Model | Active | Total | AIME'26 | HMMT'26 | LCB-v6 | IFEval | GPQA-D | MMLU-Pro |
37
+ |---|---:|---:|---:|---:|---:|---:|---:|---:|
38
+ | Zaya1-8B | 0.7B | 8B | 89.1 | 71.6 | 63.8 | 85.8 | 71.0 | 74.2 |
39
+ | Arcee-Trinity-Mini | 3B | 26B | 59.6 | 36.9 | 33.3 | 62.0 | 46.8 | 70.6 |
40
+ | N3-Nano-30B | 3B | 30B | 90.1 | 75.5 | 64.6 | 92.8 | 75.1 | 78.9 |
41
+ | OLMo-3.1-32B-Think | 32B | 32B | 78.9 | 50.6 | 58.3 | 93.2 | 59.6 | 75.8 |
42
+ | Qwen3-Next-80B-A3B-Think | 3B | 80B | 90.2 | 79.3 | 67.8 | 88.5 | 76.7 | 82.6 |
43
+ | Intellect-3 | 12B | 106B | 86.3 | 72.2 | 66.8 | 81.2 | 74.6 | 82.3 |
44
+ | Mistral-Small-4-119B | 6B | 119B | 86.4 | 70.6 | 57.9 | 84.0 | 77.2 | 81.6 |
45
+
46
+
47
+ All numbers are run on the Zyphra evaluation harness. Models are ordered by total parameter count.