EnricoFermi commited on
Commit
44c62d8
·
verified ·
1 Parent(s): 73d687b

Add HumanEval benchmark results (57.3% pass@1)

Browse files
Files changed (1) hide show
  1. README.md +16 -0
README.md CHANGED
@@ -54,6 +54,22 @@ The architecture co-evolves with training: heads that contribute to the domain s
54
  | Cycles | 3 |
55
  | Steps/Cycle | 500 |
56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  ## Runs On
58
 
59
  | Device | Format | Verified |
 
54
  | Cycles | 3 |
55
  | Steps/Cycle | 500 |
56
 
57
+ ## Benchmarks
58
+
59
+ | Model | Size | HumanEval | HumanEval+ |
60
+ |-------|------|-----------|------------|
61
+ | StarCoder2-3B | 3B | 31.7% | — |
62
+ | Qwen2.5-Coder-3B | 3B | ~31% | — |
63
+ | Phi-2 | 2.7B | 47.6% | — |
64
+ | **qwen3.5-4b-code-forged** | **3.4B** | **57.3%** | **49.4%** |
65
+
66
+ **+20% above Phi-2, +82% above StarCoder2-3B** in the sub-5B class.
67
+
68
+ - **HumanEval**: 57.3% pass@1 (94/164 base problems)
69
+ - **HumanEval+**: 49.4% pass@1 (81/164 base + extra tests)
70
+ - **Method**: Greedy decoding (temperature 0), single sample, EvalPlus framework
71
+ - **Hardware**: Evaluated as fp16 HuggingFace transformers on RTX 5090
72
+
73
  ## Runs On
74
 
75
  | Device | Format | Verified |