squ11z1 commited on
Commit
6c604c0
·
verified ·
1 Parent(s): 8a3e5e1

Add CyberGym and IFBench images to Results section

Browse files
Files changed (1) hide show
  1. README.md +13 -1
README.md CHANGED
@@ -65,13 +65,19 @@ coefficient calibration during GRPO training.
65
 
66
  ---
67
 
68
- ## CyberGym Results (arXiv:2506.02548)
 
 
69
 
70
  **CyberGym** — UC Berkeley's large-scale cybersecurity benchmark,
71
  1,507 real-world vulnerabilities from Google OSS-Fuzz across 188
72
  projects. No partial credit, no LLM judge — pass requires a valid
73
  PoC that crashes the pre-patch build.
74
 
 
 
 
 
75
  | Level | Scaffold | pass@4 |
76
  |---|---|---|
77
  | Level 0 | Full scaffolding | 62% |
@@ -83,6 +89,12 @@ PoC that crashes the pre-patch build.
83
  > at 83.1% pass@1 (overall, closed model).
84
  > Mythoseek is a 10B open-weight alternative.
85
 
 
 
 
 
 
 
86
  ---
87
 
88
  ## Intended Use
 
65
 
66
  ---
67
 
68
+ ## Results
69
+
70
+ ### CyberGym (arXiv:2506.02548)
71
 
72
  **CyberGym** — UC Berkeley's large-scale cybersecurity benchmark,
73
  1,507 real-world vulnerabilities from Google OSS-Fuzz across 188
74
  projects. No partial credit, no LLM judge — pass requires a valid
75
  PoC that crashes the pre-patch build.
76
 
77
+ <p align="center">
78
+ <img src="CyberGym.jpeg" alt="CyberGym Results" width="100%">
79
+ </p>
80
+
81
  | Level | Scaffold | pass@4 |
82
  |---|---|---|
83
  | Level 0 | Full scaffolding | 62% |
 
89
  > at 83.1% pass@1 (overall, closed model).
90
  > Mythoseek is a 10B open-weight alternative.
91
 
92
+ ### IFBench
93
+
94
+ <p align="center">
95
+ <img src="IFBench.jpeg" alt="IFBench Results" width="100%">
96
+ </p>
97
+
98
  ---
99
 
100
  ## Intended Use